Update README.md
Browse files
README.md
CHANGED
@@ -17,28 +17,34 @@ library_name: transformers
|
|
17 |
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
18 |
</p>
|
19 |
|
20 |
-
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
|
21 |
|
22 |
# ReaderLM-v2
|
23 |
|
24 |
-
`ReaderLM-v2` is
|
25 |
-
Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
|
26 |
|
27 |
## Model Overview
|
28 |
|
29 |
- **Model Type**: Autoregressive, decoder-only transformer
|
30 |
-
- **Parameter Count**:
|
31 |
- **Context Window**: Up to 512K tokens (combined input and output)
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
|
33 |
|
34 |
## What's New in `ReaderLM-v2`
|
35 |
|
36 |
-
`ReaderLM-v2`
|
37 |
|
38 |
-
- **Better Markdown Generation**:
|
39 |
-
- **JSON Output**:
|
40 |
-
- **Longer Context Handling**:
|
41 |
-
- **Multilingual Support**:
|
|
|
42 |
|
43 |
---
|
44 |
|
@@ -49,13 +55,9 @@ For a more hands-on experience in a hosted environment, see the [Google Colab No
|
|
49 |
|
50 |
## On Google Colab
|
51 |
|
52 |
-
The easiest way to experience `ReaderLM-v2` is
|
53 |
-
The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
|
54 |
-
The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
|
55 |
-
Feel free to test it with any website.
|
56 |
-
For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
|
57 |
-
However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
|
58 |
|
|
|
59 |
|
60 |
## Local Usage
|
61 |
|
@@ -78,7 +80,7 @@ To use `ReaderLM-v2` locally:
|
|
78 |
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
|
79 |
```
|
80 |
|
81 |
-
3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input
|
82 |
|
83 |
```python
|
84 |
# Patterns
|
@@ -117,31 +119,30 @@ To use `ReaderLM-v2` locally:
|
|
117 |
|
118 |
4. Create a prompt for the model:
|
119 |
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
```
|
145 |
|
146 |
### HTML to Markdown Example
|
147 |
|
@@ -158,17 +159,6 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
|
|
158 |
print(tokenizer.decode(outputs[0]))
|
159 |
```
|
160 |
|
161 |
-
### Instruction-Focused Extraction
|
162 |
-
|
163 |
-
```python
|
164 |
-
instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
|
165 |
-
input_prompt = create_prompt(html, instruction=instruction)
|
166 |
-
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
|
167 |
-
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
|
168 |
-
|
169 |
-
print(tokenizer.decode(outputs[0]))
|
170 |
-
```
|
171 |
-
|
172 |
### HTML to JSON Example
|
173 |
|
174 |
```python
|
@@ -202,6 +192,44 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
|
|
202 |
print(tokenizer.decode(outputs[0]))
|
203 |
```
|
204 |
|
205 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
206 |
|
207 |
-
|
|
|
|
|
|
|
|
|
|
17 |
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
18 |
</p>
|
19 |
|
20 |
+
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa)
|
21 |
|
22 |
# ReaderLM-v2
|
23 |
|
24 |
+
`ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
|
|
|
25 |
|
26 |
## Model Overview
|
27 |
|
28 |
- **Model Type**: Autoregressive, decoder-only transformer
|
29 |
+
- **Parameter Count**: 1.54B
|
30 |
- **Context Window**: Up to 512K tokens (combined input and output)
|
31 |
+
- **Hidden Size**: 1536
|
32 |
+
- **Number of Layers**: 28
|
33 |
+
- **Query Heads**: 12
|
34 |
+
- **KV Heads**: 2
|
35 |
+
- **Head Size**: 128
|
36 |
+
- **Intermediate Size**: 8960
|
37 |
- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
|
38 |
|
39 |
## What's New in `ReaderLM-v2`
|
40 |
|
41 |
+
`ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
|
42 |
|
43 |
+
- **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
|
44 |
+
- **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
|
45 |
+
- **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
|
46 |
+
- **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
|
47 |
+
- **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
|
48 |
|
49 |
---
|
50 |
|
|
|
55 |
|
56 |
## On Google Colab
|
57 |
|
58 |
+
The easiest way to experience `ReaderLM-v2` is through our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
+
Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
|
61 |
|
62 |
## Local Usage
|
63 |
|
|
|
80 |
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
|
81 |
```
|
82 |
|
83 |
+
3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input:
|
84 |
|
85 |
```python
|
86 |
# Patterns
|
|
|
119 |
|
120 |
4. Create a prompt for the model:
|
121 |
|
122 |
+
```python
|
123 |
+
def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
|
124 |
+
"""
|
125 |
+
Create a prompt for the model with optional instruction and JSON schema.
|
126 |
+
"""
|
127 |
+
if not instruction:
|
128 |
+
instruction = "Extract the main content from the given HTML and convert it to Markdown format."
|
129 |
+
if schema:
|
130 |
+
instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
|
131 |
+
prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
|
132 |
+
else:
|
133 |
+
prompt = f"{instruction}\n```html\n{text}\n```"
|
134 |
+
|
135 |
+
messages = [
|
136 |
+
{
|
137 |
+
"role": "user",
|
138 |
+
"content": prompt,
|
139 |
+
}
|
140 |
+
]
|
141 |
+
|
142 |
+
return tokenizer.apply_chat_template(
|
143 |
+
messages, tokenize=False, add_generation_prompt=True
|
144 |
+
)
|
145 |
+
```
|
|
|
146 |
|
147 |
### HTML to Markdown Example
|
148 |
|
|
|
159 |
print(tokenizer.decode(outputs[0]))
|
160 |
```
|
161 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
162 |
### HTML to JSON Example
|
163 |
|
164 |
```python
|
|
|
192 |
print(tokenizer.decode(outputs[0]))
|
193 |
```
|
194 |
|
195 |
+
## Model Performance
|
196 |
+
|
197 |
+
ReaderLM-v2 has been extensively evaluated on various tasks:
|
198 |
+
|
199 |
+
### Quantitative Evaluation
|
200 |
+
|
201 |
+
For HTML-to-Markdown tasks, the model outperforms much larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr, achieving:
|
202 |
+
- ROUGE-L: 0.84
|
203 |
+
- Levenshtein Distance: 0.22
|
204 |
+
- Jaro-Winkler Similarity: 0.82
|
205 |
+
|
206 |
+
For HTML-to-JSON tasks, it shows competitive performance with:
|
207 |
+
- F1 Score: 0.81
|
208 |
+
- Precision: 0.82
|
209 |
+
- Recall: 0.81
|
210 |
+
- Pass-Rate: 0.98
|
211 |
+
|
212 |
+
### Qualitative Evaluation
|
213 |
+
|
214 |
+
The model excels in three key dimensions:
|
215 |
+
- Content Integrity: 39/50
|
216 |
+
- Structural Accuracy: 35/50
|
217 |
+
- Format Compliance: 36/50
|
218 |
+
|
219 |
+
These scores demonstrate strong performance in preserving semantic information, maintaining structural accuracy, and adhering to markdown syntax standards.
|
220 |
+
|
221 |
+
## Training Details
|
222 |
+
|
223 |
+
ReaderLM-v2 is built on Qwen2.5-1.5B-Instruction and trained using a sophisticated pipeline:
|
224 |
+
|
225 |
+
1. Data Preparation: Created html-markdown-1m dataset with 1 million HTML documents
|
226 |
+
2. Synthetic Data Generation: Three-step pipeline using Qwen2.5-32B-Instruction
|
227 |
+
- Drafting: Initial markdown and JSON generation
|
228 |
+
- Refinement: Content cleanup and structure alignment
|
229 |
+
- Critique: Quality evaluation and filtering
|
230 |
|
231 |
+
3. Training Process:
|
232 |
+
- Long-context pretraining
|
233 |
+
- Supervised fine-tuning
|
234 |
+
- Direct preference optimization
|
235 |
+
- Self-play reinforcement tuning
|