{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "oUL6DV1zCIlB" }, "outputs": [], "source": [ "%matplotlib inline\n", "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": { "id": "WmJySTGXCIlD" }, "source": [ "\n", "# TorchMultimodal Tutorial: Finetuning FLAVA\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZJCb2uRyCIlE" }, "source": [ "Multimodal AI has recently become very popular owing to its ubiquitous\n", "nature, from use cases like image captioning and visual search to more\n", "recent applications like image generation from text. **TorchMultimodal\n", "is a library powered by Pytorch consisting of building blocks and end to\n", "end examples, aiming to enable and accelerate research in\n", "multimodality**.\n", "\n", "In this tutorial, we will demonstrate how to use a **pretrained SoTA\n", "model called** [FLAVA](https://arxiv.org/pdf/2112.04482.pdf)_ **from\n", "TorchMultimodal library to finetune on a multimodal task i.e. visual\n", "question answering** (VQA). The model consists of two unimodal transformer\n", "based encoders for text and image and a multimodal encoder to combine\n", "the two embeddings. It is pretrained using contrastive, image text matching and \n", "text, image and multimodal masking losses.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "0TjU3iQgCIlE" }, "source": [ "## Installation\n", "We will use TextVQA dataset and bert tokenizer from HuggingFace for this\n", "tutorial. So you need to install datasets and transformers in addition to TorchMultimodal.\n", "\n", "
When running this tutorial in Google Colab, install the required packages by\n", " creating a new cell and running the following commands:\n", "\n", "```\n", "!pip install torchmultimodal-nightly\n", "!pip install datasets\n", "!pip install transformers