Skip to content

Stage 7: Repeat Computation for Concept Balancing

CyberMeow edited this page Dec 24, 2023 · 4 revisions

Balanced dataset by computing repeat/multiply for each sub-folder

  • If you start from this stage, please set --src_dir to the training folder containing all training images, screenshots, fanarts, regularization images, or whatever (/path/to/dataset_dir/training by default).
  • In-place operation.

I assume that we have multiple types of images. They should be all put in the training folder for this stage to be effective.

Command line arguments

  • compute_multiply_up_levels: This argument specifies the number of directory levels to ascend from the rearranged directory when setting the source directory for the compute multiply stage. The default value is 1.
    Example usage: --compute_multiply_up_levels 0
  • weight_csv: This parameter allows the use of a specified CSV file to modify weights during the compute multiply stage.
    Example usage: --weight_csv path/to/custom_weighting.csv
  • min_multiply and max_multiply: These two parameters set respectively the minimum and the maximum repeat for each image. The default values are 1 and 100.
    Example usage: --min_multiply 0.5 --max_multiply 150

Technical details

We generate here the multipy.txt in each image folder to indicate the number of times that the images of this folder should be used in a repeat, with the goal to balance between different concepts during training.

To compute the multiply of each image folder, we first compute its sampling probability. We do this by going through the hierarchy, and at each node, we sample each child with probability proportional to its weight. Its weight is default to 1 but can be changed with the csv file provided through --weight_csv (default_weighting.csv is used by dedfault). It first searches for the folder name of the child directory and next searches for the pattern of the entire path (path of src_dir plus path from src_dir to the directory) as understood by fnmatch.

For example, consider the folder structure

├── ./1_character
│   ├── ./1_character/class1
│   └── ./1_character/class2
└── ./others
    ├── ./others/class1
    └── ./others/class3

and the csv

1_character, 3
class1, 4
*class2, 6

For simplicity (and this should be the good practice), assume images are only in the class folders. Then, the sampling probabilities of ./1_character/class1, ./1_character/class2, ./others/class1, and ./others/class3 are respectively 0.75 * 0.4 = 0.3, 0.75 * 0.6 = 0.45, 0.25 * 0.8 = 0.2, and 0.25 * 0.2 = 0.05. Note that the same weight of class1 can yield different sampling probability because of the other folders at the same level can have different weights (in this case ./1_character/class2 has weight 6 while ./others/class3 has weight 1).

Now that we have the sampling probability of each image folder, we can compute the weight per image by diving it by the number of images in that image folder. Finally, we convert it into multiply by setting the minimum multiply to --min_multiply (default to 1). The argument --max_multiply sets a hard limit on the maximum multiply of each image folder above which we clip to this value. After running the command you can check the log file to see if you are satisfied with the generated multiplies/repeats.

⚠️ The generated multiplies take float values. However, most trainers do not support float repeats. We may thus need to round these values to integers before launching the training process. This is done in both flatten_folder.py and prepare_hcp.py.