Object segmentation is a cornerstone problem of the computer vision field. It is used in many applications, from autonomous driving to surveillance to robotics. The goal here is to accurately identify the boundaries of objects in an image and assign a label to each pixel that indicates the object it belongs to. In the end, you get a highlight for each object in your image.
The recent advancement in deep learning made object segmentation a relatively easy problem to solve, though the challenging scenarios still remain an open issue. It is still an active area of research, and many sophisticated algorithms have been developed to tackle various problems.
One of the main problems in object segmentation models is their limited dictionaries. The majority of existing models can only segment the objects they have seen during the training. If you have an animal segmentation model trained on images of cats and dogs only, it will not segment the panda in the image.
There have been multiple attempts to tackle this “closed” vocabulary of object segmentation models. However, few works have been able to provide a unified framework that can parse all object instances and scene semantics simultaneously.
Most current approaches for open-vocabulary recognition rely on large-scale text-image discriminative models. While these pre-trained models are good at classifying individual object proposals or pixels, they are not necessarily optimal for performing scene-level structural understanding. Moreover, they often lack spatial and relational understanding, which is a bottleneck for open-vocabulary panoptic segmentation.
How can we teach them the objects they haven’t seen during the training? How can we make object segmentation models’ vocabulary an open one? Time to meet ODISE, Open-vocabulary DIffusion-based panoptic SEgmentation.
ODISE is proposed based on the observation that the text-to-image diffusion models excel at understanding the text prompts. They can recognize thousands of objects and come up with contextual understanding. So, if they can go from text to image, why not use them in reverse and go from image to text?
ODISE utilizes both large-scale diffusion models and text-image discriminative models. At a high level, it contains a pre-trained frozen text-to-image diffusion model into which an image and its caption are inputted. Then, the internal features of the diffusion model are extracted. With these features as input, the mask generator produces panoptic masks of all possible concepts in the image. The mask classification module then categorizes each mask into one of many open-vocabulary categories by associating each predicted mask’s diffusion features with text embeddings of several object category names. Once trained, ODISE performs open-vocabulary panoptic inference with both the text-image diffusion and discriminative models to classify a predicted mask.
ODISE is the first work to explore large-scale text-to-image diffusion models for open-vocabulary segmentation tasks. It proposes a novel pipeline to effectively leverage both text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. ODISE outperforms all existing baselines on many open-vocabulary recognition tasks, significantly advancing the field forward.
Check out the Paper. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.