Skip to content

Latest commit

 

History

History
43 lines (32 loc) · 2.65 KB

README.md

File metadata and controls

43 lines (32 loc) · 2.65 KB

2022 Data Mining Big Jobs

Problem Description

With the development of storage and communication technology, large-scale image data retrieval has become an urgent problem to be solved. In practical applications, image data is often converted into high-dimensional vectors by an image encoder, so the retrieval task of large-scale image data is transformed into the indexing problem of high-dimensional vectors. This project focuses on the search of high-dimensional vectors (Embedding dimension of this project is 512) and does not consider the image encoding part, so students only need to complete the retrieval task of the query vector provided by us in the large-scale vector library.

图片名称

Data and File Format

├── submissions
│   ├── output.csv
├── test_b
│   ├── gallery_emb.npy
│   ├── labels_5000.pkl
│   └── query_emb.npy
├── test_a
│   ├── gallery_emb.npy
│   ├── labels_500.pkl
│   └── query_emb.npy
├── evaluation.py
├── run.sh
├── search.py
└── Readme.md

  • test_a (query: 500; gallery:500,000) and test_b (query: 5000; gallery:5,000,000) have the same file structure. We will initially provide a smaller dataset test_a for students to debug the search algorithm (query_emb.npy is the query embeddings; gallery_emb.npy is the embeddings to be queried; label_500.pkl is the 10 indexes that belong to the same group in gallery_emb.npy for each query embedding.), and give the running code sample search.py and test code evaluation.py.
  • You only need to submit the modified search.py code to [email protected], and we will comprehensively measure the query time and P@10 indicators to give the score of this project.

test_a data download link: https://pan.baidu.com/s/1jKLpwpE1vVodaDTsq2WL7A?pwd=tgi7 code: tgi7

Evaluation indicators

Efficiency: We count the average time per query, the faster the better.

Effectiveness: For the top-10 search for each query given by the algorithm, we will calculate the precision (P@10). The higher the precision, the better.

图片名称

Final Rank: We will rank submissions based on the efficiency and effectiveness of the search algorithm submitted.