Loading
I’ve been diving into CoLLM, a new approach that solves composed image retrieval (finding images that match “this image but with these modifications”) without requiring manual training data. The key innovation is using LLMs to generate training triplets on-the-fly from standard image-caption pairs, eliminating the expensive manual annotation process.
The technical approach has several interesting components: * Creates joint embeddings that process reference images and modification texts together * Uses LLMs to understand how textual modifications apply to visual content * Generates diverse and realistic modification texts through LLM prompting * Implements efficient retrieval through contrastive learning techniques * Introduces a new 3.4M sample dataset (MTCIR) for better evaluation * Refines existing benchmarks to address annotation inconsistencies
The results are quite strong: * Achieves state-of-the-art performance across multiple CIR benchmarks * Improves performance by up to 15% compared to previous methods * Demonstrates effectiveness in both zero-shot and fine-tuned settings * Synthetic triplet generation outperforms previous zero-shot approaches
I think this approach could be transformative for multimodal AI systems beyond just image search. The ability to effectively combine visual and textual understanding without expensive manual data collection addresses a fundamental bottleneck in developing these systems.
The on-the-fly triplet generation technique could be applied to other vision-language tasks where paired data is scarce. It also suggests a more scalable path to building systems that understand natural language modifications to visual content.
That said, there are computational costs to consider – running LLMs for triplet generation adds overhead that might be challenging for real-time applications. And as with any LLM-based approach, the quality is dependent on the underlying models.
TLDR: CoLLM uses LLMs to generate training data on-the-fly for composed image retrieval, achieving SOTA results without needing expensive manual annotations. It creates joint embeddings of reference images and modification texts and introduces a new 3.4M sample dataset.
Full summary is here. Paper here.
submitted by /u/Successful-Western27
[link] [comments]