title: ATC Transcription Assistant
emoji: ✈️
colorFrom: purple
colorTo: red
sdk: docker
pinned: false
ATC Transcription Assistant
Overview
Welcome to the ATC Transcription Assistant, a tool designed to transcribe Air Traffic Control (ATC) audio. This app utilizes OpenAI’s Whisper medium.en model, fine-tuned specifically for ATC communications. The fine-tuned model significantly improves transcription accuracy for aviation communications, making it a useful tool for researchers, enthusiasts, and professionals interested in analyzing ATC communications.
This project is a part of a broader research initiative aimed at enhancing Automatic Speech Recognition (ASR) accuracy in high-stakes aviation environments.
Features
- Transcription Model: The app uses a fine-tuned version of the Whisper medium.en model.
- Audio Formats: Supports MP3 and WAV files containing ATC audio.
- Transcription Output: Converts uploaded audio into text and displays it in an easily readable format.
- Enhanced Accuracy: The fine-tuned model offers a Word Error Rate (WER) of 15.08%, a significant improvement over the 94.59% WER of the non-fine-tuned model.
Performance
- Fine-tuned Whisper medium.en WER: 15.08%
- Non fine-tuned Whisper medium.en WER: 94.59%
- Relative Improvement: 84.06%
While the fine-tuned model provides substantial improvements, please note that transcription accuracy is not guaranteed.
For more details on the fine-tuning process and model performance, see the blog post, or check out the project repository.
How It Works
- Upload ATC Audio: Upload an audio file containing ATC communications in MP3 or WAV format.
- View Transcription: The app will transcribe the audio and display the text on the screen.
- Transcribe More Audio: To transcribe another file, click New Chat in the top-right corner of the app.
Fine-Tuning Process
The Whisper model was fine-tuned on a custom ATC dataset created from publicly available resources, such as:
- The ATCO2 test subset (871 audio-transcription pairs).
- The UWB-ATCC corpus (11.3k rows in the training set and 2.82k rows in the test set).
After data preprocessing, dynamic data augmentation was applied to simulate challenging conditions during fine-tuning. The fine-tuned model was trained for 10 epochs on two A100 GPUs, achieving an average WER of 15.08%.
Limitations
- Word Error Rate (WER): While WER is a standard evaluation metric, it does not account for subtleties like meaning or word proximity, which can make the evaluation more rigid.
- Transcription Accuracy: In real-world applications, minor errors may occur, but these often don't significantly impact communication.
Get in Touch
If you have any questions or suggestions, feel free to contact me at [email protected].
License
This project is licensed under the MIT License.