Text-to-Image generation models are recently revolutionizing Artificial Intelligence (AI) and the way creative image synthesis is performed. They use powerful language models to understand text input prompts and convert them into manageable multidimensional structures called tokens, which contain all the essential information contained in the given text.
Large text models like CLIP use these tokens with a contrastive learning objective for cross-modal retrieval tasks, which involve finding semantically relevant matches between text and images. CLIP exploits vast image-text pairs datasets to learn the relationships between image and text captions. Well-established diffusion models, such as Stable Diffusion, DALL-E, or Midjourney, use CLIP for semantic awareness in the diffusion process, which is the sequence of joint procedures of adding noise to an image and denoising it to recover a more precise visualization.
From these complex models, simpler but still powerful solutions can be derived through Score Distillation Samples (SDS). SDS involves training a smaller model to predict the scores (or log probabilities) assigned to images by a larger pre-trained model, which functions as a guide for the estimation process.
Although very powerful and effective in simplifying complex diffusion models, SDS suffers from synthesis artifacts. One of the principal issues associated with SDS is mode collapse, which describes its tendency to converge towards specific modes. This often leads to the production of blurry outputs, only capturing the elements explicitly described in the prompt, like in Figure 2.
In this optic, a new information distillation technique, termed Delta Distillation Score (DDS), has been proposed. This technique’s name comes from the way the distillation score is computed. Unlike SDS, which queries the generative model with an image-text pair, DDS utilizes an additional query of a reference pair, where the text matches the image’s content.
The score constitutes the difference, or delta, between the results of the two queries.
The basic form of DDS requires two image-text pairs, one is the reference and does not change during the optimization, and the other represents the optimization target, which should match the target text prompt. DDS leads to effective gradients, which consider the edited areas of the image while leaving the others untouched.
In DDS, the source image and its text captions help estimate undesirable and noisy gradient directions introduced by SDS. In fine-grained or partial editing of the image using a new text description, the reference estimation helps get a cleaner gradient direction to update the image.
Moreover, DDS can modify images by changing their textual descriptions without requiring a visual mask to be computed or provided. Additionally, it allows the training of an image-to-image model without the need for paired training data, which results in a zero-shot image translation method. According to the authors, this zero-shot training technique can be used for single and multi-task image translation. Moreover, the source distribution can include both authentic and synthetically generated images.
An image is reported below to compare the performance difference between DDS and state-of-the-art approaches for image-to-image translation.

This was a summary of Delta Denoising Score, a novel AI technique to provide faithful, clean, and detailed image-to-image and text-to-image synthesis. If you are interested, you can learn more about this technique in the links below.
Check out the Paper and Project Page. Don’t forget to join our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.
Leave a Reply