jinaai
/

ReaderLM-v2

@@ -17,8 +17,198 @@ library_name: transformers
 <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
-[Blog](#) | [Colab](#)
 # Intro
 TBD

 <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
+[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
 # Intro
+Jina `ReaderLM-v2` is the second generation of Jina ReaderLM, a **1.5B** parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling.
+`ReaderLM-v2` features several significant improvements:
+- **Better Markdown Generation**: `ReaderLM-v2` generates markdown with improved formatting and readability.
+- **JSON Output**: `ReaderLM-v2` can output JSON format, which is useful for downstream processing.
+- **Longer Context Handling**: `ReaderLM-v2` can handle up to 512K tokens of combined input and output length.
+- **Multilingual Support**: `ReaderLM-v2` supports 29 languages, including English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more.
+# Get Started
+## On Google Colab
+The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
+which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
+The notebook is optimized for Colab's free T4 GPU tier and requires vllm and triton for acceleration and running.
+Feel free to test it with any website.
+For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
+However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
+## Local
+To use this model, you need to install `transformers`:
+```bash
+pip install transformers
+```
+### HTML to Markdown Conversion
+Then, you can use the model to convert HTML to Markdown as follows:
+```python
+# pip install transformers
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import re
+# (REMOVE <SCRIPT> to </script> and variations)
+SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
+# (REMOVE HTML <STYLE> to </style> and variations)
+STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
+# (REMOVE HTML <META> to </meta> and variations)
+META_PATTERN = r'<[ ]*meta.*?>'  # mach any char zero or more times
+# (REMOVE HTML COMMENTS <!-- to --> and variations)
+COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
+# (REMOVE HTML LINK <LINK> to </link> and variations)
+LINK_PATTERN = r'<[ ]*link.*?>'  # mach any char zero or more times
+# (REPLACE base64 images)
+BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
+# (REPLACE <svg> to </svg> and variations)
+SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
+def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
+    return re.sub(
+        SVG_PATTERN,
+        lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
+        html,
+        flags=re.DOTALL,
+    )
+def replace_base64_images(html: str, new_image_src: str = "#") -> str:
+    return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
+def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
+    html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
+    html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
+    html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
+    html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
+    html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
+    if clean_svg:
+        html = replace_svg(html)
+    if clean_base64:
+        html = replace_base64_images(html)
+    return html
+device = "cuda" # for GPU usage or "cpu" for CPU usage
+tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
+model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
+def create_prompt(text: str, tokenizer = None, instruction: str = None, schema: str = None) -> str:
+    """
+    Create a prompt for the model with optional instruction and JSON schema.
+    Args:
+        text (str): The input HTML text
+        tokenizer: The tokenizer to use
+        instruction (str, optional): Custom instruction for the model
+        schema (str, optional): JSON schema for structured extraction
+    Returns:
+        str: The formatted prompt
+    """
+    if not instruction:
+        instruction = "Extract the main content from the given HTML and convert it to Markdown format."
+    if schema:
+        instruction = 'Extract the specified information from a list of news threads and present it in a structured JSON format.'
+        prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
+    else:
+        prompt = f"{instruction}\n```html\n{text}\n```"
+    messages = [
+        {
+            "role": "user",
+            "content": prompt,
+        }
+    ]
+    return tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+# example html content
+html = "<html><body><h1>Hello, world!</h1></body></html>"
+# clean the html content, remove scripts, styles, comments, etc.
+html = clean_html(html)
+input_prompt = create_prompt(html)
+print(input_prompt)
+inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
+print(tokenizer.decode(outputs[0]))
+```
+You can also specify the content you want to extract from the HTML by providing a custom instruction.
+For example, if you want to extract the menu items from the HTML content, you can create a prompt like this:
+```python
+instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
+input_prompt = create_prompt(html, instruction=instruction)
+inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
+print(tokenizer.decode(outputs[0]))
+```
+### HTML to JSON Conversion
+To extract structured information from HTML content and convert it to JSON, you can create a prompt with a JSON schema.
+```python
+schema = """
+{
+  "type": "object",
+  "properties": {
+    "title": {
+      "type": "string"
+    },
+    "author": {
+      "type": "string"
+    },
+    "date": {
+      "type": "string"
+    },
+    "content": {
+      "type": "string"
+    }
+  },
+  "required": ["title", "author", "date", "content"]
+}
+"""
+input_prompt = create_prompt(html, schema=schema)
+inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
+outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
+print(tokenizer.decode(outputs[0]))
+```
+## AWS Sagemaker & Azure Marketplace
 TBD