Base model facebook/opt-2.7b
Fine-tuned for causal language modeling of transcribed spoken dialogue from the TalkBank CABank collection. Training corpora include:
- CABNC - Spoken language segment of the British National Corpus
- CallFriend English (N) - Phone calls
- CallFriend English (S) - Phone calls
- CallHome English - Phone calls
- GCSAusE - Australian conversations
- ISL - Conversations recorded to test ASR methods for meeting
- MICASE - Michigan Corpus of Academic Spoken English
- SCoSE - The Saarbrücken Corpus of Spoken (American) English.
(Corpus descriptions are from TalkBank)
Data input format: The data format models a sequence of spoken dialogue between two or more participants:
- The sequence is prefixed with information about the participants including name (can be a proper noun, a title/role, or unknown), age (can be a number or unknown), and sex (can be male, female, other, unknown).
- It then proceeds to sequentially list all utterances in the conversation, each prefixed with their participant code (S1, S2, S3, etc.).
- Utterances support a limited set of transcription notations in the CHAT & CHAT-CA formats:
- Pauses:
(.)
for a generic short pause, or(N.N)
for a timed pause. For example(3.4)
is a pause for 3.4 seconds. - Non-verbal sounds:
&=laughs
,&=cough
,&=breathes
,&=click
, etc. Anything describing a speaker-produced non-verbal sound can come after a prefix of&=
- Comments about speaker or setting:
[% baby crying in background]
,[% smiling]
,[% phone clicking noise]
,[% imitating him]
, etc. Anything describing the state of the speaker or environment can be in this block. Also, a comment block can be used to describe speaker-produced sounds, but it is more common to use the&=
prefix for that. - Unknown or unintelligible utterances:
xxx
- Breathing:
hhh
- Pauses:
Example:
<participant> S1 (name: Dave, age: 33, sex: male) <participant> S2 (name: unknown, age: unknown, sex: unknown) <dialog> S1: Hi! (2.3) are you there? S2: hhh hhh [% background noise] uh yeah (0.8) I can hear you. (1.2) &=cough can you hear me? S1: ...
Usage Info:
Per the OPT documentation, the model was trained with tokenizer setting use_fast=False
.
To use this model for real-time inference in a continuous duplex dialogue system, see: https://github.com/AbrahamSanders/realtime-chatbot.
- Downloads last month
- 10