File size: 5,460 Bytes
40c7e2a
 
 
 
 
 
 
 
 
49ce301
d458a68
0aca220
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d458a68
 
 
 
 
 
 
 
0aca220
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d458a68
0aca220
 
 
 
 
 
 
 
 
 
 
 
d458a68
0aca220
d458a68
 
 
0aca220
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
language:
- en
arxiv: 2412.15838
license: cc-by-nc-4.0
tags:
- any-to-any
---

# AnyRewardModel

<span style="color: red;">All-Modality Generation benchmark evaluates a model's ability to follow instructions, automatically select appropriate modalities, and create synergistic outputs across different modalities (text, visual, audio) while avoiding redundancy.</span>

[🏠 Homepage](https://github.com/PKU-Alignment/align-anything) | [πŸ‘ Our Official Code Repo](https://github.com/PKU-Alignment/align-anything)

[πŸ€— All-Modality Understanding Benchmark](https://huggingface.co/datasets/PKU-Alignment/EvalAnything-AMU) 

[πŸ€— All-Modality Generation Benchmark (Instruction Following Part)](https://huggingface.co/datasets/PKU-Alignment/EvalAnything-InstructionFollowing) 

[πŸ€— All-Modality Generation Benchmark (Modality Selection and Synergy Part)](https://huggingface.co/datasets/PKU-Alignment/EvalAnything-Selection_Synergy) 

[πŸ€— All-Modality Generation Reward Model](https://huggingface.co/PKU-Alignment/AnyRewardModel) 



## Data Example

<div align="center">
  <img src="example-amg.png" width="100%"/>
</div>

## Usage
```python
from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("PKU-Alignment/AnyRewardModel", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("PKU-Alignment/AnyRewardModel", trust_remote_code=True)
```

For Image-Audio Modality Synergy scoring:
```python
user_prompt: str = 'USER: {input}'
assistant_prompt: str = '\nASSISTANT:\n{modality}{text_response}'

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def process_ia(prompt, image_path, audio_path):
    image_pixel_values = processor(data_paths = image_path, modality="image").pixel_values
    audio_pixel_values = processor(data_paths = audio_path, modality="audio").pixel_values

    text_input = processor(
        text = user_prompt.format(input = prompt) + \
                assistant_prompt.format(modality = "<image><audio>", text_response = ""),
        modality="text"
    )
    return {
        "input_ids": text_input.input_ids,
        "attention_mask": text_input.attention_mask,
        "pixel_values_1": image_pixel_values.unsqueeze(0),
        "pixel_values_2": audio_pixel_values.unsqueeze(0),
        "modality": [["image", "audio"]]
    }


score = sigmoid(model(**process_ia(prompt, image_path, audio_path)).end_scores.squeeze(dim=-1).item())
```

For Text-Image Modality Synergy scoring:
```python
user_prompt: str = 'USER: {input}'
assistant_prompt: str = '\nASSISTANT:\n{modality}{text_response}'

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def process_ti(prompt, response, image_path):
    image_pixel_values = processor(data_paths = image_path, modality="image").pixel_values
    text_input = processor(
        text = user_prompt.format(input = prompt) + \
                assistant_prompt.format(modality = "<image>", text_response = response),
        modality="text"
    )
    return {
        "input_ids": text_input.input_ids,
        "attention_mask": text_input.attention_mask,
        "pixel_values_1": image_pixel_values.unsqueeze(0),
        "modality": [["image", "text"]]
    }

score = sigmoid(model(**process_ti(prompt, response, image_path)).end_scores.squeeze(dim=-1).item())
```

For Text-Audio Modality Synergy scoring:
```python
user_prompt: str = 'USER: {input}'
assistant_prompt: str = '\nASSISTANT:\n{modality}{text_response}'

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def process_ta(prompt, response, audio_path):
    audio_pixel_values = processor(data_paths = audio_path, modality="audio").pixel_values
    text_input = processor(
        text = user_prompt.format(input = prompt) + \
                assistant_prompt.format(modality = "<audio>", text_response = response),
        modality="text"
    )
    return {
        "input_ids": text_input.input_ids,
        "attention_mask": text_input.attention_mask,
        "pixel_values_1": audio_pixel_values.unsqueeze(0),
        "modality": [["audio", "text"]]
    }

score = sigmoid(model(**process_ta(prompt, response, audio_path)).end_scores.squeeze(dim=-1).item())
```

## Note:
1. Before using AnyRewardModel, you should install following dependency in [requirements.txt](https://huggingface.co/PKU-Alignment/AnyRewardModel/blob/main/requirements.txt):
```txt
ftfy
timm
regex
einops
fvcore
decord
torchaudio
torchvision
pytorchvideo
```

2. If you encounter the following error:
```
ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor'
```
Please refer to guide at [blog](https://blog.csdn.net/lanxing147/article/details/136625264) for detailed resolution steps.

**Note:** The current code is a sample script for the All-Modality Generation subtask of Eval Anything. In the future, we will integrate Eval Anything's evaluation into the framework to provide convenience for community use.

## Citation
Please cite our work if you use our benchmark or model in your paper.
```bibtex
@inproceedings{ji2024align,
  title={Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback},
  author={Jiaming Ji and Jiayi Zhou and Hantao Lou and Boyuan Chen and Donghai Hong and Xuyao Wang and Wenqi Chen and Kaile Wang and Rui Pan and Jiahao Li and Mohan Wang and Josef Dai and Tianyi Qiu and Hua Xu and Dong Li and Weipeng Chen and Jun Song and Bo Zheng and Yaodong Yang},
  year={2024},
  url={https://arxiv.org/abs/2412.15838}
}
```