jinaai
/

ReaderLM-v2

@@ -17,28 +17,34 @@ library_name: transformers
 <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
-[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
 # ReaderLM-v2
-`ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a **1.5B** parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts.
-Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
 ## Model Overview
 - **Model Type**: Autoregressive, decoder-only transformer
-- **Parameter Count**: ~1.5B
 - **Context Window**: Up to 512K tokens (combined input and output)
 - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
 ## What's New in `ReaderLM-v2`
-`ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b):
-- **Better Markdown Generation**: Generates cleaner, more readable Markdown output.
-- **JSON Output**: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing.
-- **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents.
-- **Multilingual Support**: Covers 29 languages for broader applications across international web data.
 ---
@@ -49,13 +55,9 @@ For a more hands-on experience in a hosted environment, see the [Google Colab No
 ## On Google Colab
-The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
-The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
-The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
-Feel free to test it with any website.
-For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
-However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
 ## Local Usage
@@ -78,7 +80,7 @@ To use `ReaderLM-v2` locally:
    model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
    ```
-3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):
    ```python
    # Patterns
@@ -117,31 +119,30 @@ To use `ReaderLM-v2` locally:
 4. Create a prompt for the model:
-  ```python
-  def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
-      """
-      Create a prompt for the model with optional instruction and JSON schema.
-      """
-      if not instruction:
-          instruction = "Extract the main content from the given HTML and convert it to Markdown format."
-      if schema:
-          # This is an example instruction for JSON output
-          instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
-          prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
-      else:
-          prompt = f"{instruction}\n```html\n{text}\n```"
-      messages = [
-          {
-              "role": "user",
-              "content": prompt,
-          }
-      ]
-      return tokenizer.apply_chat_template(
-          messages, tokenize=False, add_generation_prompt=True
-      )
-  ```
 ### HTML to Markdown Example
@@ -158,17 +159,6 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
 print(tokenizer.decode(outputs[0]))
 ```
-### Instruction-Focused Extraction
-```python
-instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
-input_prompt = create_prompt(html, instruction=instruction)
-inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
-print(tokenizer.decode(outputs[0]))
-```
 ### HTML to JSON Example
 ```python
@@ -202,6 +192,44 @@ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=F
 print(tokenizer.decode(outputs[0]))
 ```
-## AWS Sagemaker & Azure Marketplace & Google Cloud Platform
-Coming soon.

 <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
+[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing) | [AWS](https://aws.amazon.com/marketplace/pp/prodview-jwfct4j4rvxk2?sr=0-21&ref_=beagle&applicationId=AWSMPContessa)
 # ReaderLM-v2
+`ReaderLM-v2` is a 1.5B parameter language model that converts raw HTML into beautifully formatted markdown or JSON with superior accuracy and improved longer context handling. Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
 ## Model Overview
 - **Model Type**: Autoregressive, decoder-only transformer
+- **Parameter Count**: 1.54B
 - **Context Window**: Up to 512K tokens (combined input and output)
+- **Hidden Size**: 1536
+- **Number of Layers**: 28
+- **Query Heads**: 12
+- **KV Heads**: 2
+- **Head Size**: 128
+- **Intermediate Size**: 8960
 - **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
 ## What's New in `ReaderLM-v2`
+`ReaderLM-v2` represents a significant leap forward from its predecessor, with several key improvements:
+- **Better Markdown Generation**: Thanks to its new training paradigm and higher-quality training data, the model excels at generating complex elements like code fences, nested lists, tables, and LaTeX equations.
+- **JSON Output**: Introduces direct HTML-to-JSON generation using predefined schemas, eliminating the need for intermediate markdown conversion.
+- **Longer Context Handling**: Handles up to 512K tokens combined input and output length, with improved performance on long-form content.
+- **Multilingual Support**: Comprehensive support across 29 languages for broader applications.
+- **Enhanced Stability**: Greatly alleviates degeneration issues after generating long sequences through contrastive loss during training.
 ---
 ## On Google Colab
+The easiest way to experience `ReaderLM-v2` is through our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing), which demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example. The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
+Note that the free T4 GPU has limitations—it doesn't support bfloat16 or flash attention 2, leading to higher memory usage and slower processing of longer inputs. Nevertheless, ReaderLM-v2 successfully processes large documents under these constraints, achieving processing speeds of 67 tokens/s input and 36 tokens/s output. For production use, we recommend an RTX 3090/4090 for optimal performance.
 ## Local Usage
    model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
    ```
+3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input:
    ```python
    # Patterns
 4. Create a prompt for the model:
+   ```python
+   def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
+       """
+       Create a prompt for the model with optional instruction and JSON schema.
+       """
+       if not instruction:
+           instruction = "Extract the main content from the given HTML and convert it to Markdown format."
+       if schema:
+           instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
+           prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
+       else:
+           prompt = f"{instruction}\n```html\n{text}\n```"
+       messages = [
+           {
+               "role": "user",
+               "content": prompt,
+           }
+       ]
+       return tokenizer.apply_chat_template(
+           messages, tokenize=False, add_generation_prompt=True
+       )
+   ```
 ### HTML to Markdown Example
 print(tokenizer.decode(outputs[0]))
 ```
 ### HTML to JSON Example
 ```python
 print(tokenizer.decode(outputs[0]))
 ```
+## Model Performance
+ReaderLM-v2 has been extensively evaluated on various tasks:
+### Quantitative Evaluation
+For HTML-to-Markdown tasks, the model outperforms much larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr, achieving:
+- ROUGE-L: 0.84
+- Levenshtein Distance: 0.22
+- Jaro-Winkler Similarity: 0.82
+For HTML-to-JSON tasks, it shows competitive performance with:
+- F1 Score: 0.81
+- Precision: 0.82
+- Recall: 0.81
+- Pass-Rate: 0.98
+### Qualitative Evaluation
+The model excels in three key dimensions:
+- Content Integrity: 39/50
+- Structural Accuracy: 35/50
+- Format Compliance: 36/50
+These scores demonstrate strong performance in preserving semantic information, maintaining structural accuracy, and adhering to markdown syntax standards.
+## Training Details
+ReaderLM-v2 is built on Qwen2.5-1.5B-Instruction and trained using a sophisticated pipeline:
+1. Data Preparation: Created html-markdown-1m dataset with 1 million HTML documents
+2. Synthetic Data Generation: Three-step pipeline using Qwen2.5-32B-Instruction
+   - Drafting: Initial markdown and JSON generation
+   - Refinement: Content cleanup and structure alignment
+   - Critique: Quality evaluation and filtering
+3. Training Process:
+   - Long-context pretraining
+   - Supervised fine-tuning
+   - Direct preference optimization
+   - Self-play reinforcement tuning