hanxiao commited on
Commit
af99331
·
verified ·
1 Parent(s): 1256ad3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -54
README.md CHANGED
@@ -17,28 +17,34 @@ library_name: transformers
17
  <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
  </p>
19
 
20
- [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
21
 
22
  # ReaderLM-v2
23
 
24
- `ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a **1.5B** parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts.
25
- Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
26
 
27
  ## Model Overview
28
 
29
  - **Model Type**: Autoregressive, decoder-only transformer
30
- - **Parameter Count**: ~1.5B
31
  - **Context Window**: Up to 512K tokens (combined input and output)
 
 
 
 
 
 
32
  - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
33
 
34
  ## What's New in `ReaderLM-v2`
35
 
36
- `ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b):
37
 
38
- - **Better Markdown Generation**: Generates cleaner, more readable Markdown output.
39
- - **JSON Output**: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing.
40
- - **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents.
41
- - **Multilingual Support**: Covers 29 languages for broader applications across international web data.
 
42
 
43
  ---
44
 
@@ -49,13 +55,9 @@ For a more hands-on experience in a hosted environment, see the [Google Colab No
49
 
50
  ## On Google Colab
51
 
52
- The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
53
- The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
54
- The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
55
- Feel free to test it with any website.
56
- For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
57
- However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
58
 
 
59
 
60
  ## Local Usage
61
 
@@ -78,7 +80,7 @@ To use `ReaderLM-v2` locally:
78
  model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
79
  ```
80
 
81
- 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):
82
 
83
  ```python
84
  # Patterns
@@ -117,31 +119,30 @@ To use `ReaderLM-v2` locally:
117
 
118
  4. Create a prompt for the model:
119
 
120
- ```python
121
- def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
122
- """
123
- Create a prompt for the model with optional instruction and JSON schema.
124
- """
125
- if not instruction:
126
- instruction = "Extract the main content from the given HTML and convert it to Markdown format."
127
- if schema:
128
- # This is an example instruction for JSON output
129
- instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
130
- prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
131
- else:
132
- prompt = f"{instruction}\n```html\n{text}\n```"
133
-
134
- messages = [
135
- {
136
- "role": "user",
137
- "content": prompt,
138
- }
139
- ]
140
-
141
- return tokenizer.apply_chat_template(
142
- messages, tokenize=False, add_generation_prompt=True
143
- )
144
- ```
145
 
146
  ### HTML to Markdown Example
147
 
@@ -158,17 +159,6 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
158
  print(tokenizer.decode(outputs[0]))
159
  ```
160
 
161
- ### Instruction-Focused Extraction
162
-
163
- ```python
164
- instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
165
- input_prompt = create_prompt(html, instruction=instruction)
166
- inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
167
- outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
168
-
169
- print(tokenizer.decode(outputs[0]))
170
- ```
171
-
172
  ### HTML to JSON Example
173
 
174
  ```python
@@ -202,6 +192,44 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
202
  print(tokenizer.decode(outputs[0]))
203
  ```
204
 
205
- ## AWS Sagemaker & Azure Marketplace & Google Cloud Platform
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
- Coming soon.
 
 
 
 
 
17
  <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
  </p>
19
 
20
+ [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa)
21
 
22
  # ReaderLM-v2
23
 
24
+ `ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
 
25
 
26
  ## Model Overview
27
 
28
  - **Model Type**: Autoregressive, decoder-only transformer
29
+ - **Parameter Count**: 1.54B
30
  - **Context Window**: Up to 512K tokens (combined input and output)
31
+ - **Hidden Size**: 1536
32
+ - **Number of Layers**: 28
33
+ - **Query Heads**: 12
34
+ - **KV Heads**: 2
35
+ - **Head Size**: 128
36
+ - **Intermediate Size**: 8960
37
  - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
38
 
39
  ## What's New in `ReaderLM-v2`
40
 
41
+ `ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
42
 
43
+ - **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
44
+ - **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
45
+ - **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
46
+ - **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
47
+ - **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
48
 
49
  ---
50
 
 
55
 
56
  ## On Google Colab
57
 
58
+ The easiest way to experience `ReaderLM-v2` is through our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
 
 
 
 
 
59
 
60
+ Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
61
 
62
  ## Local Usage
63
 
 
80
  model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
81
  ```
82
 
83
+ 3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input:
84
 
85
  ```python
86
  # Patterns
 
119
 
120
  4. Create a prompt for the model:
121
 
122
+ ```python
123
+ def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
124
+ """
125
+ Create a prompt for the model with optional instruction and JSON schema.
126
+ """
127
+ if not instruction:
128
+ instruction = "Extract the main content from the given HTML and convert it to Markdown format."
129
+ if schema:
130
+ instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
131
+ prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
132
+ else:
133
+ prompt = f"{instruction}\n```html\n{text}\n```"
134
+
135
+ messages = [
136
+ {
137
+ "role": "user",
138
+ "content": prompt,
139
+ }
140
+ ]
141
+
142
+ return tokenizer.apply_chat_template(
143
+ messages, tokenize=False, add_generation_prompt=True
144
+ )
145
+ ```
 
146
 
147
  ### HTML to Markdown Example
148
 
 
159
  print(tokenizer.decode(outputs[0]))
160
  ```
161
 
 
 
 
 
 
 
 
 
 
 
 
162
  ### HTML to JSON Example
163
 
164
  ```python
 
192
  print(tokenizer.decode(outputs[0]))
193
  ```
194
 
195
+ ## Model Performance
196
+
197
+ ReaderLM-v2 has been extensively evaluated on various tasks:
198
+
199
+ ### Quantitative Evaluation
200
+
201
+ For HTML-to-Markdown tasks, the model outperforms much larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr, achieving:
202
+ - ROUGE-L: 0.84
203
+ - Levenshtein Distance: 0.22
204
+ - Jaro-Winkler Similarity: 0.82
205
+
206
+ For HTML-to-JSON tasks, it shows competitive performance with:
207
+ - F1 Score: 0.81
208
+ - Precision: 0.82
209
+ - Recall: 0.81
210
+ - Pass-Rate: 0.98
211
+
212
+ ### Qualitative Evaluation
213
+
214
+ The model excels in three key dimensions:
215
+ - Content Integrity: 39/50
216
+ - Structural Accuracy: 35/50
217
+ - Format Compliance: 36/50
218
+
219
+ These scores demonstrate strong performance in preserving semantic information, maintaining structural accuracy, and adhering to markdown syntax standards.
220
+
221
+ ## Training Details
222
+
223
+ ReaderLM-v2 is built on Qwen2.5-1.5B-Instruction and trained using a sophisticated pipeline:
224
+
225
+ 1. Data Preparation: Created html-markdown-1m dataset with 1 million HTML documents
226
+ 2. Synthetic Data Generation: Three-step pipeline using Qwen2.5-32B-Instruction
227
+ - Drafting: Initial markdown and JSON generation
228
+ - Refinement: Content cleanup and structure alignment
229
+ - Critique: Quality evaluation and filtering
230
 
231
+ 3. Training Process:
232
+ - Long-context pretraining
233
+ - Supervised fine-tuning
234
+ - Direct preference optimization
235
+ - Self-play reinforcement tuning