nateraw's picture
Update README.md
4e9403d
metadata
language: en
license: mit
library_name: transformers
tags:
  - video-classification
  - videomae
  - vision

Model Card for videomae-base-finetuned-ucf101

A WandB report here for metrics.

Table of Contents

  1. Model Details
  2. Uses
  3. Bias, Risks, and Limitations
  4. Training Details
  5. Evaluation
  6. Model Examination
  7. Environmental Impact
  8. Technical Specifications
  9. Citation
  10. Glossary
  11. More Information
  12. Model Card Authors
  13. Model Card Contact
  14. How To Get Started With the Model

Model Details

Model Description

VideoMAE Base model fine tuned on UCF101

  • Developed by: @nateraw
  • Shared by [optional]: [More Information Needed]
  • Model type: fine-tuned
  • Language(s) (NLP): en
  • License: mit
  • Related Models [optional]: [More Information Needed]
  • Resources for more information: [More Information Needed]

Uses

Direct Use

This model can be used for Video Action Recognition

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recomendations.

Training Details

Training Data

[More Information Needed]

Training Procedure [optional]

Preprocessing

We sampled clips from the videos of 64 frames, then took a uniform sample of those frames to get 16 frame inputs for the model. During training, we used PyTorchVideo's MixVideo to apply mixup/cutmix.

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

We only trained/evaluated one fold from the UCF101 annotations. Unlike in the VideoMAE paper, we did not perform inference over multiple crops/segments of validation videos, so the results are likely slightly lower than what you would get if you did that too.

  • Eval Accuracy: 0.758209764957428
  • Eval Accuracy Top 5: 0.8983050584793091

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

@nateraw

Model Card Contact

@nateraw

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand
from decord import VideoReader, cpu
import torch
import numpy as np

from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
from huggingface_hub import hf_hub_download

np.random.seed(0)


def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    converted_len = int(clip_len * frame_sample_rate)
    end_idx = np.random.randint(converted_len, seg_len)
    start_idx = end_idx - converted_len
    indices = np.linspace(start_idx, end_idx, num=clip_len)
    indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    return indices


# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
    repo_id="nateraw/dino-clips", filename="archery.mp4", repo_type="space"
)
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 16 frames
videoreader.seek(0)
indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
video = videoreader.get_batch(indices).asnumpy()

feature_extractor = VideoMAEFeatureExtractor.from_pretrained("nateraw/videomae-base-finetuned-ucf101")
model = VideoMAEForVideoClassification.from_pretrained("nateraw/videomae-base-finetuned-ucf101")

inputs = feature_extractor(list(video), return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# model predicts one of the 101 UCF101 classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])