--- language: en license: mit library_name: transformers tags: - video-classification - videomae - vision --- # Model Card for videomae-base-finetuned-ucf101 A [WandB report here](https://wandb.ai/nateraw/videomae-finetune-ucf101/reports/Fine-Tuning-VideoMAE-Base-on-UCF101--VmlldzoyOTUwMjk4) for metrics. # Table of Contents 1. [Model Details](#model-details) 2. [Uses](#uses) 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations) 4. [Training Details](#training-details) 5. [Evaluation](#evaluation) 6. [Model Examination](#model-examination-optional) 7. [Environmental Impact](#environmental-impact) 8. [Technical Specifications](#technical-specifications-optional) 9. [Citation](#citation-optional) 10. [Glossary](#glossary-optional) 11. [More Information](#more-information-optional) 12. [Model Card Authors](#model-card-authors-optional) 13. [Model Card Contact](#model-card-contact) 14. [How To Get Started With the Model](#how-to-get-started-with-the-model) # Model Details ## Model Description VideoMAE Base model fine tuned on UCF101 - **Developed by:** [@nateraw](https://huggingface.co/nateraw) - **Shared by [optional]:** [More Information Needed] - **Model type:** fine-tuned - **Language(s) (NLP):** en - **License:** mit - **Related Models [optional]:** [More Information Needed] - **Parent Model [optional]:** [MCG-NJU/videomae-base](https://huggingface.co/MCG-NJU/videomae-base) - **Resources for more information:** [More Information Needed] # Uses ## Direct Use This model can be used for Video Action Recognition ## Downstream Use [optional] [More Information Needed] ## Out-of-Scope Use [More Information Needed] # Bias, Risks, and Limitations [More Information Needed] ## Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations. # Training Details ## Training Data [More Information Needed] ## Training Procedure [optional] ### Preprocessing We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's [`MixVideo`](https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/transforms/mix.py) to apply mixup/cutmix. ### Speeds, Sizes, Times [More Information Needed] # Evaluation ## Testing Data, Factors & Metrics ### Testing Data [More Information Needed] ### Factors [More Information Needed] ### Metrics [More Information Needed] ## Results We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too. - Eval Accuracy: 0.758209764957428 - Eval Accuracy Top 5: 0.8983050584793091 # Model Examination [optional] [More Information Needed] # Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] # Technical Specifications [optional] ## Model Architecture and Objective [More Information Needed] ## Compute Infrastructure [More Information Needed] ### Hardware [More Information Needed] ### Software [More Information Needed] # Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] # Glossary [optional] [More Information Needed] # More Information [optional] [More Information Needed] # Model Card Authors [optional] [@nateraw](https://huggingface.co/nateraw) # Model Card Contact [@nateraw](https://huggingface.co/nateraw) # How to Get Started with the Model Use the code below to get started with the model.
Click to expand ```python from decord import VideoReader, cpu import torch import numpy as np from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification from huggingface_hub import hf_hub_download np.random.seed(0) def sample_frame_indices(clip_len, frame_sample_rate, seg_len): converted_len = int(clip_len * frame_sample_rate) end_idx = np.random.randint(converted_len, seg_len) start_idx = end_idx - converted_len indices = np.linspace(start_idx, end_idx, num=clip_len) indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64) return indices # video clip consists of 300 frames (10 seconds at 30 FPS) file_path = hf_hub_download( repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space" ) videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0)) # sample 16 frames videoreader.seek(0) indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader)) video = videoreader.get_batch(indices).asnumpy() feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101") model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101") inputs = feature_extractor(list(video), return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # model predicts one of the 101 UCF101 classes predicted_label = logits.argmax(-1).item() print(model.config.id2label[predicted_label]) ```