File size: 4,608 Bytes
29f6111
 
 
 
750a1ac
 
29f6111
 
3543bae
29f6111
 
3543bae
 
29f6111
3543bae
29f6111
3543bae
29f6111
7e283c6
29f6111
3543bae
29f6111
3543bae
29f6111
 
 
 
 
 
 
 
 
 
 
 
2a00d64
 
 
 
 
 
 
 
dd0d9c9
2a00d64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3543bae
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: llama2
---

**This is not an officially supported Google product.**

## Overview

[DiarizationLM](https://arxiv.org/abs/2401.03506) model finetuned
on the training subset of the Fisher corpus.

* Foundation model: [unsloth/llama-2-13b-bnb-4bit](https://huggingface.co/unsloth/llama-2-13b-bnb-4bit)
* Finetuning scripts: https://github.com/google/speaker-id/tree/master/DiarizationLM/unsloth

## Training config

This model is finetuned on the training subset of the Fisher corpus, using a LoRA adapter of rank 256. The total number of training parameters is 1,001,390,080. With a batch size of 16, this model has been trained for 12000 steps, which is ~4 epochs of the training data.

We use the `mixed` flavor during our training, meaning we combine data from `hyp2ora` and `deg2ref` flavors. After the prompt builder, we have a total of 48,142 prompt-completion pairs in our training set.

The finetuning took more than 3 days on a Google Cloud VM instance that has one NVIDIA A100 GPU with 80GB memory.

The maximal length of the prompt to this model is 6000 characters, including the " --> " suffix. The maximal sequence length is 4096 tokens.

## Metrics

Performance on the Fisher testing set:

| System  | WER (%) | WDER (%) | cpWER (%) |
| ------- | ------- | -------- | --------- |
| USM + turn-to-diarize baseline | 15.48 | 5.32 | 21.19 |
| + This model | - | 3.65 | 18.92 |

## Usage

First, you need to install two packages:

```
pip install transformers diarizationlm
```

On a machine with GPU and CUDA, you can use the model by running the following script:

```python
from transformers import LlamaForCausalLM, LlamaTokenizer
from diarizationlm import utils

HYPOTHESIS = """<speaker:1> Hello, how are you doing <speaker:2> today? I am doing well. What about <speaker:1> you? I'm doing well, too. Thank you."""

print("Loading model...")
tokenizer = LlamaTokenizer.from_pretrained("google/DiarizationLM-13b-Fisher-v1", device_map="cuda")
model = LlamaForCausalLM.from_pretrained("google/DiarizationLM-13b-Fisher-v1", device_map="cuda")

print("Tokenizing input...")
inputs = tokenizer([HYPOTHESIS + " --> "], return_tensors = "pt").to("cuda")

print("Generating completion...")
outputs = model.generate(**inputs,
                         max_new_tokens = inputs.input_ids.shape[1] * 1.2,
                         use_cache = False)

print("Decoding completion...")
completion = tokenizer.batch_decode(outputs[:, inputs.input_ids.shape[1]:],
                                    skip_special_tokens = True)[0]

print("Transferring completion to hypothesis text...")
transferred_completion = utils.transfer_llm_completion(completion, HYPOTHESIS)

print("========================================")
print("Hypothesis:", HYPOTHESIS)
print("========================================")
print("Completion:", completion)
print("========================================")
print("Transferred completion:", transferred_completion)
print("========================================")
```

The output will look like below:

```
Loading model...
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:17<00:00,  2.84s/it]
Tokenizing input...
Generating completion...
Decoding completion...
Transferring completion to hypothesis text...
========================================
Hypothesis: <speaker:1> Hello, how are you doing <speaker:2> today? I am doing well. What about <speaker:1> you? I'm doing well, too. Thank you.
========================================
Completion: 19:27 <speaker:1> hello, how are you doing today? <speaker:2> i am doing well. What about you? <speaker:1> i'm doing well, too. thank you. <speaker:2> my name
========================================
Transferred completion: <speaker:1> Hello, how are you doing today? <speaker:2> I am doing well. What about you? <speaker:1> I'm doing well, too. Thank you.
```

## Citation

Our paper is cited as:

```
@article{wang2024diarizationlm,
  title={{DiarizationLM: Speaker Diarization Post-Processing with Large Language Models}},
  author={Quan Wang and Yiling Huang and Guanlong Zhao and Evan Clark and Wei Xia and Hank Liao},
  journal={arXiv preprint arXiv:2401.03506},
  year={2024}
}
```