numb3r3 commited on
Commit
a6c0d30
·
verified ·
1 Parent(s): 6e1f1ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -1
README.md CHANGED
@@ -17,8 +17,198 @@ library_name: transformers
17
  <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
  </p>
19
 
20
- [Blog](#) | [Colab](#)
21
 
22
  # Intro
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  TBD
 
17
  <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
  </p>
19
 
20
+ [Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
21
 
22
  # Intro
23
 
24
+ Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
25
+
26
+ `ReaderLM-v2` features several significant improvements:
27
+
28
+ - **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
29
+ - **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
30
+ - **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
31
+ - **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
32
+
33
+
34
+ # Get Started
35
+
36
+ ## On Google Colab
37
+ The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
38
+ which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
39
+ The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
40
+ Feel free to test it with any website.
41
+ For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
42
+ However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
43
+
44
+ ## Local
45
+
46
+ To use this model, you need to install `transformers`:
47
+
48
+ ```bash
49
+ pip install transformers
50
+ ```
51
+
52
+
53
+ ### HTML to Markdown Conversion
54
+
55
+ Then, you can use the model to convert HTML to Markdown as follows:
56
+
57
+ ```python
58
+ # pip install transformers
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ import re
61
+
62
+ # (REMOVE <SCRIPT> to </script> and variations)
63
+ SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>' # mach any char zero or more times
64
+
65
+ # (REMOVE HTML <STYLE> to </style> and variations)
66
+ STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>' # mach any char zero or more times
67
+
68
+ # (REMOVE HTML <META> to </meta> and variations)
69
+ META_PATTERN = r'<[ ]*meta.*?>' # mach any char zero or more times
70
+
71
+ # (REMOVE HTML COMMENTS <!-- to --> and variations)
72
+ COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>' # mach any char zero or more times
73
+
74
+ # (REMOVE HTML LINK <LINK> to </link> and variations)
75
+ LINK_PATTERN = r'<[ ]*link.*?>' # mach any char zero or more times
76
+
77
+ # (REPLACE base64 images)
78
+ BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
79
+
80
+ # (REPLACE <svg> to </svg> and variations)
81
+ SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
82
+
83
+ def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
84
+ return re.sub(
85
+ SVG_PATTERN,
86
+ lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
87
+ html,
88
+ flags=re.DOTALL,
89
+ )
90
+
91
+ def replace_base64_images(html: str, new_image_src: str = "#") -> str:
92
+ return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
93
+
94
+ def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
95
+ html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
96
+ html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
97
+ html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
98
+ html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
99
+ html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
100
+
101
+ if clean_svg:
102
+ html = replace_svg(html)
103
+
104
+ if clean_base64:
105
+ html = replace_base64_images(html)
106
+
107
+ return html
108
+
109
+
110
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
111
+ tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
112
+ model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
113
+
114
+ def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
115
+ """
116
+ Create a prompt for the model with optional instruction and JSON schema.
117
+
118
+ Args:
119
+ text (str): The input HTML text
120
+ tokenizer: The tokenizer to use
121
+ instruction (str, optional): Custom instruction for the model
122
+ schema (str, optional): JSON schema for structured extraction
123
+
124
+ Returns:
125
+ str: The formatted prompt
126
+ """
127
+
128
+ if not instruction:
129
+ instruction = "Extract the main content from the given HTML and convert it to Markdown format."
130
+
131
+ if schema:
132
+ instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
133
+ prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
134
+ else:
135
+ prompt = f"{instruction}\n```html\n{text}\n```"
136
+
137
+ messages = [
138
+ {
139
+ "role": "user",
140
+ "content": prompt,
141
+ }
142
+ ]
143
+
144
+ return tokenizer.apply_chat_template(
145
+ messages, tokenize=False, add_generation_prompt=True
146
+ )
147
+
148
+ # example html content
149
+ html = "<html><body><h1>Hello, world!</h1></body></html>"
150
+
151
+ # clean the html content, remove scripts, styles, comments, etc.
152
+ html = clean_html(html)
153
+
154
+ input_prompt = create_prompt(html)
155
+
156
+ print(input_prompt)
157
+
158
+ inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
159
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
160
+
161
+ print(tokenizer.decode(outputs[0]))
162
+ ```
163
+
164
+ You can also specify the content you want to extract from the HTML by providing a custom instruction.
165
+ For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:
166
+
167
+ ```python
168
+ instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
169
+ input_prompt = create_prompt(html, instruction=instruction)
170
+
171
+ inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
172
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
173
+
174
+ print(tokenizer.decode(outputs[0]))
175
+ ```
176
+
177
+ ### HTML to JSON Conversion
178
+
179
+ To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.
180
+
181
+ ```python
182
+ schema = """
183
+ {
184
+ "type": "object",
185
+ "properties": {
186
+ "title": {
187
+ "type": "string"
188
+ },
189
+ "author": {
190
+ "type": "string"
191
+ },
192
+ "date": {
193
+ "type": "string"
194
+ },
195
+ "content": {
196
+ "type": "string"
197
+ }
198
+ },
199
+ "required": ["title", "author", "date", "content"]
200
+ }
201
+ """
202
+
203
+ input_prompt = create_prompt(html, schema=schema)
204
+
205
+ inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
206
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
207
+
208
+ print(tokenizer.decode(outputs[0]))
209
+ ```
210
+
211
+
212
+ ## AWS Sagemaker & Azure Marketplace
213
+
214
  TBD