metadata

title: ATC Transcription Assistant
emoji: ✈️
colorFrom: purple
colorTo: red
sdk: docker
pinned: false

ATC Transcription Assistant

Overview

Welcome to the ATC Transcription Assistant, a tool designed to transcribe Air Traffic Control (ATC) audio. This app utilizes OpenAI’s Whisper medium.en model, fine-tuned specifically for ATC communications. The fine-tuned model significantly improves transcription accuracy for aviation communications, making it a useful tool for researchers, enthusiasts, and professionals interested in analyzing ATC communications.

This project is a part of a broader research initiative aimed at enhancing Automatic Speech Recognition (ASR) accuracy in high-stakes aviation environments.

Features

Transcription Model: The app uses a fine-tuned version of the Whisper medium.en model.
Audio Formats: Supports MP3 and WAV files containing ATC audio.
Transcription Output: Converts uploaded audio into text and displays it in an easily readable format.
Enhanced Accuracy: The fine-tuned model offers a Word Error Rate (WER) of 15.08%, a significant improvement over the 94.59% WER of the non-fine-tuned model.

Performance

Fine-tuned Whisper medium.en WER: 15.08%
Non fine-tuned Whisper medium.en WER: 94.59%
Relative Improvement: 84.06%

While the fine-tuned model provides substantial improvements, please note that transcription accuracy is not guaranteed.

For more details on the fine-tuning process and model performance, see the blog post, or check out the project repository.

How It Works

Upload ATC Audio: Upload an audio file containing ATC communications in MP3 or WAV format.
View Transcription: The app will transcribe the audio and display the text on the screen.
Transcribe More Audio: To transcribe another file, click New Chat in the top-right corner of the app.

Fine-Tuning Process

The Whisper model was fine-tuned on a custom ATC dataset created from publicly available resources, such as:

The ATCO2 test subset (871 audio-transcription pairs).
The UWB-ATCC corpus (11.3k rows in the training set and 2.82k rows in the test set).

After data preprocessing, dynamic data augmentation was applied to simulate challenging conditions during fine-tuning. The fine-tuned model was trained for 10 epochs on two A100 GPUs, achieving an average WER of 15.08%.

Limitations

Word Error Rate (WER): While WER is a standard evaluation metric, it does not account for subtleties like meaning or word proximity, which can make the evaluation more rigid.
Transcription Accuracy: In real-world applications, minor errors may occur, but these often don't significantly impact communication.

Get in Touch

If you have any questions or suggestions, feel free to contact me at [email protected].

License

This project is licensed under the MIT License.