Spaces:
Build error
KoCLIP
KoCLIP is a Korean port of OpenAI's CLIP.
Models
We trained a total of two models, koclip-base
and koclip-large
. Both models use RoBERTa-large, a fairly large language model. This decision was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to producing a performant multimodal pipeline given limited data.
KoCLIP | LM | ViT |
---|---|---|
koclip-base |
klue/roberta-large |
openai/clip-vit-base-patch32 |
koclip-large |
klue/roberta-large |
google/vit-large-patch16-224 |
Data
KoCLIP was fine-tuned using 82,783 images from the MSCOCO 2014 image captioning dataset. Korean translations of image captions were obtained from AI Hub, an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40000 images from the validation set of the aforementioned dataset.
While we also considered alternative multilingual image captioning datsets, notably the Wikipedia-based Image Text Dataset, we found non-trivial discrepancies in the way captions were curated in WiT and MSCOCO, and eventually decided to train the model on relatively cleaner captions of MSCOCO instead of introducing more noise.
Demo
We present three demos, which each illustrate different use cases of KoCLIP.
- Image to Text: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
- Text to * Image: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrive the image that best matches given text.
- Text to Patch: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.
Findings
In this section, we detail some interesting findings we made throughout the project.
Prompting
We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as
์ด๊ฒ์ {{}} ์ด๋ค (This is {{}}.)
noticably helped the model. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.
Multilinguality
Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingl well for simple words (e.g. "dog"). This could be one of two reasons, or a combination thereof:
ViT Pretraining: The ViT backbone for
koclip-base
,openai/clip-vit-base-patch32
, was already pretrained on an English image captioning dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is the fact thatkoclip-large
also demonstrates limited multilingual behavior.LM Knowledge Bleed:
klue/roberta-large
was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for bothkoclip-base
andkoclip-large
. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."
Future Work
Due to time and resource contraints, we have yet to compare KoCLIP to other open-source baselines, such as M-CLIP. We hope to benchmark KoCLIP on various metrics and evaluation datasets to further determine its performance and reliability. In addition, given that prompt engineering is somewhat of a mystery and an active area of ongoing research, we hope to explore more scientific approaches to the topic.
References
@misc{park2021klue,
title={KLUE: Korean Language Understanding Evaluation},
author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
year={2021},
eprint={2105.09680},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{radford2021learning,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
year={2021},
eprint={2103.00020},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We thank the teams at Hugging Face and Google for arranging this wonderful oportunity. It has been a busy yet enormously rewarding week for all of us. Hope you enjoy the demo!