Files changed (1) hide show
  1. README.md +71 -25
README.md CHANGED
@@ -9,17 +9,56 @@ pipeline_tag: text-to-video
9
  ---
10
 
11
 
12
- # Model Card
13
- ## Details
14
- This model underwent training using CLIP4Clip, a video retrieval method based on the CLIP framework, as described in the paper [CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
15
- Retrieval](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [code](https://github.com/ArrowLuo/CLIP4Clip).
16
 
17
- The training process involved 150,000 videos obtained from the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
 
18
 
19
- In order to integrate the trained clip model into the implementation of [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32), we have made modifications to the weights.
20
 
 
21
 
22
- ### Use with Transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  ### Extracting Text Embeddings:
24
 
25
  ```python
@@ -36,31 +75,38 @@ tokenizer = CLIPTokenizer.from_pretrained("Diangle/clip4clip-webvid")
36
  inputs = tokenizer(text=search_sentence , return_tensors="pt")
37
  outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
38
 
39
- # Normalizing the embeddings:
40
  final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
41
  final_output = final_output.cpu().detach().numpy()
42
  print("sequence_output: ", sequence_output)
43
  ```
44
 
45
- ### Extracting Video Embeddings:
46
-
47
- An additional [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb) is available that provides instructions on how to perform video embedding.
48
-
49
-
50
- ## Model Intended Use
51
-
52
- This model is intended to use for video retrieval, look for example this [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid).
53
 
 
54
 
55
- ## Performance
 
 
56
 
57
- We have evaluated the performance of differenet models on the last 10k video clips from Webvid database.
58
 
59
- | Model | R1 | R5 | R10 | MedianR | MeanR
60
- |------------------------|-------|-------|-------|-----|---------|
61
- | Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
62
- | CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023
63
- | **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
64
- | Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964
 
 
 
65
 
66
- For more information about the evaluation you can look at this [notebook](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
 
12
+ # Model Card for CLIP4Clip/WebVid-150k
13
+ ## Model Details
14
+ A CLIP4Clip video-text retrieval model trained on a subset of the WebVid dataset.
15
+ The model and training method are described in the paper ["Clip4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"](https://arxiv.org/pdf/2104.08860.pdf) by Lou et el, and implemented in the accompanying [GitHub repository](https://github.com/ArrowLuo/CLIP4Clip).
16
 
17
+ The training process utilized the [WebVid Dataset](https://m-bain.github.io/webvid-dataset/), a comprehensive collection of short videos with corresponding textual descriptions sourced from the web.
18
+ For training purposes, a subset consisting of the first 150,000 video-text pairs from the dataset were used.
19
 
20
+ This HF model is based on the [clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture, with weights trained by Daphna Idelson at [Searchium](https://www.searchium.ai).
21
 
22
+ ## Motivation
23
 
24
+ As per the original authors, the main motivation behind this work is to leverage the power of the CLIP image-language pre-training model and apply it to learning
25
+ visual-temporal concepts from videos, thereby improving video-based searches.
26
+
27
+ By using the WebVid dataset, the model's capabilities were enhanced even beyond those described in the paper, thanks to the large-scale and diverse nature of the dataset empowering the model's performance.
28
+
29
+
30
+ ## Model Intended Use
31
+
32
+ This model is intended for use in large scale video-text retrieval applications.
33
+
34
+ To illustrate its functionality, refer to the accompanying [**Video Search Space**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) which provides a search demonstration on a vast collection of approximately 1.5 million videos.
35
+ This interactive demo showcases the model's capability to effectively retrieve videos based on text queries, highlighting its potential for handling substantial video datasets.
36
+
37
+ ## Evaluations
38
+
39
+ To evaluate the model's performance we used the last last 10,000 video clips and their accompanying text from the Webvid dataset.
40
+ We evaluate R1,R5,R10,MedianR and MeanR on:
41
+ 1. Zero-shot pretrained clip-vit-base-patch32 model
42
+ 2. CLIP4Clip based weights trained on the dataset [MSR-VTT](https://paperswithcode.com/dataset/msr-vtt), consisting of 10,000 video-text pairs
43
+ 3. CLIP4Clip based weights trained on a 150K subset of the dataset Webvid-2M
44
+ 4. CLIP4Clip based weights trained on a 150K subset of the dataset Webvid-2M - binarized and further finetuned on 100 top searches -
45
+ for search acceleration and efficiency [<a href="#footnote1">1</a></sup>].
46
+
47
+ | Model | R1 &uarr; | R5 &uarr; | R10 &uarr; | MedianR &darr; | MeanR &darr;
48
+ |------------------------|-------|-------|-------|-----|---------|
49
+ | Zero-shot clip weights | 37.16 | 62.10 | 71.16 | 3.0 | 42.2128
50
+ | CLIP4Clip weights trained on msr-vtt | 38.38 | 62.89 | 72.01 | 3.0 |39.3023
51
+ | **CLIP4Clip trained on 150k Webvid** | 50.74 | 77.30 | 85.05 | 1.0 | 14.9535
52
+ | Binarized CLIP4Clip trained on 150k Webvid with rerank100 | 50.56 | 76.39 | 83.51 | 1.0 | 43.2964
53
+
54
+ For an elaborate description of the evaluation refer to the notebook
55
+ [GSI_VideoRetrieval-Evaluation](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval-Evaluation.ipynb).
56
+
57
+ <div id="footnote1">
58
+ <p>[1] For overall search acceleration capabilities, in order to boost you search application, please refer to searchium.ai</p>
59
+ </div>
60
+
61
+ ### How to use
62
  ### Extracting Text Embeddings:
63
 
64
  ```python
 
75
  inputs = tokenizer(text=search_sentence , return_tensors="pt")
76
  outputs = model(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
77
 
78
+ # Normalize embeddings for retrieval:
79
  final_output = outputs[0] / outputs[0].norm(dim=-1, keepdim=True)
80
  final_output = final_output.cpu().detach().numpy()
81
  print("sequence_output: ", sequence_output)
82
  ```
83
 
84
+ ### Extracting Video Embeddings:
 
 
 
 
 
 
 
85
 
86
+ Due to a moderate level of complexity in extracting video embeddings, an example usage with utility functions are provided in the additional notebook [GSI_VideoRetrieval_VideoEmbedding.ipynb](https://huggingface.co/Diangle/clip4clip-webvid/blob/main/Notebooks/GSI_VideoRetrieval_VideoEmbedding.ipynb).
87
 
88
+ ## Acknowledgements
89
+ Acknowledging Diana Mazenko of [Searchium](https://www.searchium.ai) for adapting and loading the model to Hugging Face, and for creating a Hugging Face [**SPACE**](https://huggingface.co/spaces/Diangle/Clip4Clip-webvid) for a large-scale video-search demo.
90
+ Acknowledgments also to Lou et el for their comprehensive work on CLIP4Clip and openly available code.
91
 
92
+ ## Citations
93
 
94
+ CLIP4Clip paper
95
+ ```
96
+ @Article{Luo2021CLIP4Clip,
97
+ author = {Huaishao Luo and Lei Ji and Ming Zhong and Yang Chen and Wen Lei and Nan Duan and Tianrui Li},
98
+ title = {{CLIP4Clip}: An Empirical Study of CLIP for End to End Video Clip Retrieval},
99
+ journal = {arXiv preprint arXiv:2104.08860},
100
+ year = {2021},
101
+ }
102
+ ```
103
 
104
+ OpenAI CLIP paper
105
+ ```
106
+ @inproceedings{Radford2021LearningTV,
107
+ title={Learning Transferable Visual Models From Natural Language Supervision},
108
+ author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
109
+ booktitle={ICML},
110
+ year={2021}
111
+ }
112
+ ```