|
--- |
|
library_name: diffusers |
|
--- |
|
|
|
# MGIE |
|
|
|
This repository contains the UNet and LLaVA model checkpoints from [Guiding Instruction-based Image Editing via Multimodal Large Language Models](https://arxiv.org/abs/2309.17102). |
|
|
|
For a detailed example of usage, refer to [this notebook](https://github.com/apple/ml-mgie/blob/main/demo.ipynb) and the [official repository](https://github.com/apple/ml-mgie). Additionally, this notebook is a memory-optimized version of the original one. This decouples the MGIE inference pipeline into two broad stages: |
|
|
|
1. Calculate all the embeddings in a batched manner with the LLaVA model and the edit head. |
|
2. Pop it off the memory to gain VRAM. |
|
3. Loads the InstructPix2Pix pipeline and performs editing. |
|
|
|
π‘ MGIE needs additional set up steps that are important to follow before running inference. Please refer to the |
|
repository for those instructions. Importantly, it needs you to merge the LLaVA weight deltas with |
|
the original LLaMA parameters. More details are in the repository. |
|
|
|
|
|
## Processing ultra high-resolution images |
|
|
|
Since the [InstructPi2xPi2x pipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pix2pix) doesn't do any internal processing |
|
to resize the input images, you might get OOMs when processing ultra high-resolution images |
|
like [this one](https://i.imgur.com/CiAbKbS.jpg). |
|
|
|
So, it's recommended to resize them, preserving their aspect-ratio. Here's a utility function that can be leveraged here: |
|
|
|
```python |
|
from diffusers.utils import load_image |
|
|
|
def resize_image_aspect_ratio(img_url, base_width=None, base_height=None): |
|
# Load the image |
|
img = load_image(img_url).convert("RGB") |
|
|
|
# Get the current width and height of the image |
|
width, height = img.size |
|
|
|
# Calculate the new dimensions based on the aspect ratio |
|
if base_width is not None: |
|
# Calculate new height based on the base_width to maintain aspect ratio |
|
w_percent = (base_width / float(width)) |
|
h_size = int((float(height) * float(w_percent))) |
|
new_size = (base_width, h_size) |
|
elif base_height is not None: |
|
# Calculate new width based on the base_height to maintain aspect ratio |
|
h_percent = (base_height / float(height)) |
|
w_size = int((float(width) * float(h_percent))) |
|
new_size = (w_size, base_height) |
|
else: |
|
raise ValueError("Either base_width or base_height must be provided") |
|
|
|
# Resize the image |
|
resized_img = img.resize(new_size, Image.ANTIALIAS) |
|
return resized_img |
|
``` |
|
|
|
## Citation |
|
|
|
``` |
|
@inproceedings{fu2024mgie, |
|
author = {Tsu-Jui Fu and Wenze Hu and Xianzhi Du and William Yang Wang and Yinfei Yang, and Zhe Gan}, |
|
β title = {{Guiding Instruction-based Image Editing via Multimodal Large Language Models}}, |
|
β booktitle = {International Conference on Learning Representations (ICLR)}, |
|
β year = {2024} |
|
} |
|
``` |