Questions about Dataset class Interface and dataset prepare #32

gony-noreply · 2021-09-24T13:27:33Z

Question about dataset class Interface

I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.

I am trying to access the file directly by using get_dataset_fn method instead of get_dataset_iterator method, and I wonder if this is not an issue.
If possible, there seems to be something wrong with the implementation of the get_dataset_fn method for small datasets.

In the get_dataset_fn method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.

Qustion about dataset prepare

big-ann-benchmarks/benchmark/main.py

Line 145 in 8180e0e

dataset.prepare(True) # prepare dataset, but skip potentially huge base vectors

I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.

The text was updated successfully, but these errors were encountered:

gony-noreply · 2021-09-24T13:28:54Z

The original second question was during the build phase of the benchmark process. It was a question about whether I should prepare data (downlaod) myself for build(In the benchmark code, skipdata is set to True)

But someone might be wondering if it is possible to use dataset vectors at the time of search.

harsha-simhadri · 2021-09-24T18:45:54Z

In evaluation, the dataset is not available. For T2, the index can store a copy of the data (or a compressed version) as part of the 1TB limit on index size
In index build it is, and contributes to the total storage limit.

maumueller · 2021-09-24T20:09:10Z

When I look at get_dataset_fn, it seems to me that it returns the actual file in case you are working with a cropped version.

https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/datasets.py#L268-L276. This should be safe to use. Maybe I'm misunderstanding your question, @gony0 ?

In general, you will have to take care of dowloading the base vectors by explicitly using python create_dataset.py --dataset .... However, it would be easy to add an argument to run.py so that it takes care of this. I will happily do that if this seems to be a common use case.

gony-noreply · 2021-10-06T01:32:23Z

@maumueller
Sorry for the late reply

it seems to me that it returns the actual file in case you are working with a cropped version.

If there are both a 1B size file and a crop file, a 1B size file path is always returned.
see below codes.

big-ann-benchmarks/benchmark/datasets.py

Lines 283 to 286 in 59eab9f

    
           fn = os.path.join(self.basedir, self.ds_fn) 
        
           if os.path.exists(fn): 
        
               return fn 
        
           if self.nb != 10**9:

ds_fn has a 1B size file name and creates a crop file name only when there is no 1B size file.

I talked about where this code didn't work well in my development environments(not actual competition evaluation environment) like both 1B dataset and small dataset were exists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about Dataset class Interface and dataset prepare #32

Questions about Dataset class Interface and dataset prepare #32

gony-noreply commented Sep 24, 2021

gony-noreply commented Sep 24, 2021

harsha-simhadri commented Sep 24, 2021

maumueller commented Sep 24, 2021

gony-noreply commented Oct 6, 2021

Questions about Dataset class Interface and dataset prepare #32

Questions about Dataset class Interface and dataset prepare #32

Comments

gony-noreply commented Sep 24, 2021

Question about dataset class Interface

Qustion about dataset prepare

gony-noreply commented Sep 24, 2021

harsha-simhadri commented Sep 24, 2021

maumueller commented Sep 24, 2021

gony-noreply commented Oct 6, 2021