Iterative Object Count Optimization for Text-to-image Diffusion Models

1Tel-Aviv University, 2Bar-Ilan University,
Teaser

We propose a plug-and-play optimization of object counting accuracy of a text-to-image model based on detection models.

Abstract

We address a persistent challenge in text-to-image models: accurately generating a specified number of objects. Current models, which learn from image-text pairs, inherently struggle with counting, as training data cannot depict every possible number of objects for any given object. To solve this, we propose optimizing the generated image based on a counting loss derived from a counting model that aggregates an object’s potential. Employing an out-of-the-box counting model is challenging for two reasons: first, the model requires a scaling hyperparameter for the potential aggregation that varies depending on the viewpoint of the objects, and second, classifier guidance techniques require modified models that operate on noisy intermediate diffusion steps. To address these challenges, we propose an iterated online training mode that improves the accuracy of inferred images while altering the text conditioning embedding and dynamically adjusting hyperparameters. Our method offers three key advantages: (i) it can be based on any detection models; (ii) it is a zero-shot plug-and-play solution facilitating rapid changes to the counting techniques and image generation methods; and (iii) the optimized counting token can be reused to generate accurate images without further optimization. We evaluate the generation of various objects and show significant improvements in accuracy.

Iterative Optimization

Interpolate start reference image.

5 oranges

Interpolate start reference image.

5 cupcakes

Interpolate start reference image.

5 cars

Interpolate start reference image.

5 sheep


Method

Our method iteratively generates images and calculates a counting loss based on object potentials. To align the counting with non-derivable detection-based counting, we scale the potentials at each step based on a detection model (e.g., YOLO). We also consider a CLIP matching penalty term to maintain the semantics of the image while changing the object count. The loss at each step is used to update an added counting token, which is used to control the generation. The counting token can be reused after convergence to generate additional images with better accuracy without the need for further optimization.

Optimization Process

We employ a differentiable object potential model that approximates the number of objects. Below, we show an illustration of the optimization process at different time steps for the prompts “A photo of 5 cupcakes” and “A photo of 4 cottons”, alongside their object potential maps.

Detection-based Dynamic Scale

A scaling hyperparameter is required to aggregate and normalize the potential map to a single count. Finding a general scaling factor for any object is challenging due to varying object sizes and different viewpoints, which lead to deformed potential maps. To address this, we control the generation process by scaling a hyperparameter using a detection-based counting score. This also allows us to influence the optimization according to the detection model.

Token Reuse

The count token can be re-used. For instance, our method trained a token for the prompt “A photo of 10 oranges,” which was later employed for “A photo of 10 strawberries” or similarly for pigeons and crows. This potentially reduces the need for repeated inference time optimization for new counting tokens for the same quantities.

Further, our counting token can be reused with different prompts. For instance, our token accurately counts the number of oranges even when we ask it to be with a dog. Moreover, we show examples of using trained tokens with different backgrounds, such as three cups in the sea, and demonstrate the ability to reuse the counting token in different styles, such as painting.

BibTeX

 @misc{zafar2024iterativeobjectcountoptimization,
      title={Iterative Object Count Optimization for Text-to-image Diffusion Models},
      author={Oz Zafar and Lior Wolf and Idan Schwartz},
      year={2024},
      eprint={2408.11721},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.11721},
}