Vision-Language

Diffusion Instruction Tuning

Created Lavender, an SFT method aligning VLM text-vision attention with Stable Diffusion, boosting Llama-3.2-11B and MiniCPM-v2.5 by up to 30% on 20 tasks.

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

Segment Anyword

Training-free prompt learning for language-grounded segmentation using token-level cross-attention from a frozen diffusion model to generate object masks.

Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Teare, Huiyu Zhou, Chen Jin

Segment Anyword