Replicating DeepSeek R1 for Information Extraction

Community Article Published January 31, 2025

I have been working on replicating DeepSeek R1 since its release. My main focus was on information extraction, particularly on zero-shot text-to-graph extraction. This is a task when giving a list of entity and relation types we extract a list of entities from a target text and the relations between them.

Example of text-to-graph output:

{
    "entities": [
        {
            "id": 0,
            "text": "Microsoft",
            "type": "company"
        },
        {
            "id": 1,
            "text": "Satya Nadella",
            "type": "person"
        },
        {
            "id": 2,
            "text": "Azure AI",
            "type": "product",
        }
    ],
    "relations": [
        {
            "head": "Satya Nadella",
            "tail": "Microsoft",
            "type": "CEO of"
        },
        {
            "head": "Microsoft",
            "tail": "Azure AI",
            "type": "developed"
        }
    ]
}

It’s a quite complicated task, especially for small generative language models. Language models can do it relatively well if we don’t constrain outputs by the required entity and relation types and allow a model to freely extract all entities and relations from the text. But when we condition output by entity and relation types it becomes truly a nightmare for LMs. From my experiments, it was hard to train small language models in a supervised manner to conditionally output graphs from text based on input entity types. Reinforcement learning approaches give hope, so let’s discuss it in detail.


Reinforcement learning is different from supervised learning in the fact that we don’t explicitly tell the model which actions to take to achieve desirable milestones. In our case, the milestone is a correctly extracted graph considering input entity and relation types and actions are tokens a model generates. We can directly tell the model how to reproduce this graph let’s say in JSON by maximizing the probability of generating output in the desirable format.

Many people talk about the importance of thinking as one of the main boosts that RL bring to LLMs space, while many papers have shown that a chain of thoughts improves the performance of models, it looks pretty reasonable that thinking improves the performance. However, I think several other properties of RL can have a significant impact. But firstly let’s discuss the GRPO approach introduced by DeepSeek. I am putting the loss function used by the team below.

image/png

Not diving deep into math, let me discuss in high-level terms its meaning. Basically, we generate a set of candidate solutions to a given problem and maximize the probability of returning a solution to a question based on the reward it gets. Additionally, with the KL convergence component, we try to minimize the drift from the original model we took as a starting point.

Such a training algorithm can bring interesting properties such as that we force the model to generate several solutions which are apriori hard negatives for a given problem considering that a model assigns relatively high probabilities of their generation. So in some sense, the model sees both positive and negative examples during training.

Additionally, as Andrej Karpathy pointed out direct label examples can’t force the model to utilize its knowledge to infer some new emergent properties, such as an “aha” moment:

“The model could never learn this with 1 (by imitation) because the cognition of the model and the cognition of the human labeller is different. Humans would never know how to correctly annotate these kinds of problem-solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.”

Another interesting property of RL is that we can optimize different objectives and manually control their influence, for example, if we see that our model struggles with relation extraction we can assign higher rewards for examples generated with correct extracted relations in comparison to other features.


So, let’s discuss what exactly we did to train a model using GRPO for text-to-graph inference, you can see a visual diagram that shows it:

image/png

The training process consists of three major stages: synthetic data generation, supervised training, and reinforcement learning (RL) training. Each of these stages plays a crucial role in improving the model’s ability to perform structured information extraction.

  1. Synthetic Data Generation

To bootstrap the process, we start with data collection, where we gather diverse text sources relevant to our target domain. The text-to-graph generation step, powered by Llama 70B structured generation, converts unstructured text into graph-based representations. However, this step is imperfect, and therefore, selecting and augmenting data becomes essential to filter out low-quality extractions and enrich the dataset with more diverse structures.

Additionally, we feed generated with structured prediction JSON data and feed them and text into DeepSeek-R1 Llama 70B to generate a chain of thought that can explain the extraction process.

We experimented with both thinking-enabled and disabled modes and discovered that small models struggle to discover some interesting and important thinking strategies.

  1. Supervised Training

Before starting reinforcement learning and considering that we use small models additional supervised training is required to push model return data in the right format, We used only 1k examples for this purpose.

  1. Reinforcement Learning with GRPO

Supervised training alone does not fully solve the problem, especially when it comes to conditioning model outputs on predefined entity and relation types. To address this, we employ Group Relative Policy Optimization (GRPO) for reinforcement learning.

  • Format reward ensures that the output follows a structured format, where thinking is encapsulated in a respective tag (in the case of thinking mode).
  • JSON reward specifically validates well-formed, machine-readable JSON representations and that its structure aligns with the desirable format.
  • F1 reward evaluates the accuracy of extracted entities and relations by comparing them to ground truth graphs.

I gave different coefficients for rewards functions, prioritizing F1 reward, because from my early experiments the model stuck in local minimum in generated small JSON output.

The reinforcement learning stage allows the model to dynamically adjust its generation strategy, emphasizing correct relation extraction when necessary. Additionally, GRPO enables the model to generate multiple candidate solutions and learn from both positive and negative examples, leading to more robust text-to-graph extraction.

Below you can see the changes in different rewards through time, as you can see F1 reward constantly grows while the JSON reward saturates quickly due to supervised pre-training.

image/png

The model was able to improve its performance after short unsupervised learning and its performance could be even better with more reinforcement learning training steps.


We plan to make more experiments taking larger models and more high-quality data, so stay tuned. In the meantime, you can try one of the models from our experiments:

https://huggingface.co/Ihor/Text2Graph-R1-Qwen2.5-0.5b

To run the model please refer to the code snippet below:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Ihor/Text2Graph-R1-Qwen2.5-0.5b"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = """Your text here..."""
prompt = "Analyze this text, identify the entities, and extract meaningful relationships as per given instructions:{}"
messages = [
    {"role": "system", "content": (
                "You are an assistant trained to process any text and extract named entities and relations from it. "
            "Your task is to analyze user-provided text, identify all unique and contextually relevant entities, and infer meaningful relationships between them"
            "Output the annotated data in JSON format, structured as follows:\n\n"
            """{"entities": [{"type": entity_type_0", "text": "entity_0", "id": 0}, "type": entity_type_1", "text": "entity_1", "id": 0}], "relations": [{"head": "entity_0", "tail": "entity_1", "type": "re_type_0"}]}"""
    )},
    {"role": "user", "content": prompt.format(text)}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The code can be find at this repo, huge thanks to Hugging Face Open-R1 and TRL projects.

The dataset used in this research project can be find here.

Feel free to share your thoughts and ask questions!

Community

Sign up or log in to comment