Dataset Preparation

Data format

The data directory is constucted as follows:

.
├── data
|   ├── features
|   |   └── xxx.bin
│   ├── labels
|   |   └── xxx.meta
│   ├── knns
|   |   └── ...

features currently supports binary file. (We plan to support np.save file in near future.)
labels supports plain text where each line indicates a label corresponding to the feature file.
knns is not necessary as it can be built with the provided functions.

Take MS-Celeb-1M (Part0 and Part1) for an example. The data directory is as follows:

data
  ├── features
    ├── part0_train.bin                 # acbbc780948e7bfaaee093ef9fce2ccb
    ├── part1_test.bin                  # ced42d80046d75ead82ae5c2cdfba621
  ├── labels
    ├── part0_train.meta                # class_num=8573, inst_num=576494
    ├── part1_test.meta                 # class_num=8573, inst_num=584013
  ├── knns
    ├── part0_train/faiss_k_80.npz      # 5e4f6c06daf8d29c9b940a851f28a925
    ├── part1_test/faiss_k_80.npz       # d4a7f95b09f80b0167d893f2ca0f5be5
  ├── pretrained_models
    ├── pretrained_gcn_d_ms1m.pth       # 213598e70ddbc50f5e3661a6191a8be1
    ├── pretrained_gcn_s_ms1m.pth       # 3251d6e7d4f9178f504b02d8238726f7
    ├── pretrained_gcn_d_iop_ms1m.pth   # 314fba47b5156dcc91383ad611d5bd96
    ├── pretrained_gcn_v_ms1m.pth       # 020236d4e8dbff975360f08cb47109c0
    ├── pretrained_gcn_e_ms1m.pth       # 315ff08f28f14bc494dd36158c11e900
    ├── pretrained_lgcn_ms1m.pth        # 97fc6e52d1b5e907eabeb01e7b0825f9

To experiment with custom dataset, it is required to provided extracted features and labels. For training, the number of features should be equal to the number of labels. For testing, the F-score will be evaluated if labels are provided, otherwise only clustering results will be generated.

Note that labels is only required for training clustering model, but it is not mandatory for clustering unlabeled data. Basically, there are two ways to cluster unlabeled data without meta file. (1) Do not pass the label_path in config file. It will not generate loss and evaluation results. (2) Make a pseudo meta label, e.g., setting all labels to -1, but just ignore the loss and the evaluation results.

Supported datasets

The supported datasets are listed below.

MS-Celeb-1M
- Part1 (584K): GoogleDrive or BaiduYun (passwd: geq5)
- Benchmarks (5.21M): GoogleDrive.
- Precomputed KNN: GoogleDrive.
- Image Lists: GoogleDrive.
- Original Images: GoogleDrive. We re-align MS1M-ArcFace with our own face alignment model.
- Pretrained Face Recognition Model: GoogleDrive. For using the model to extract features, please check the code and use sample data to have a try.
YouTube-Face: GoogleDrive or BaiduYun (passwd: aper).
DeepFashion: GoogleDrive or BaiduYun (passwd: 8fai)

You can download datasets with above links or with scripts below:

python tools/download_data.py

Now, you can switch to README.md to train and test the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATASET.md

DATASET.md

Dataset Preparation

Data format

Supported datasets

Files

DATASET.md

Latest commit

History

DATASET.md

File metadata and controls

Dataset Preparation

Data format

Supported datasets