Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Dataset class Interface and dataset prepare #32

Open
gony-noreply opened this issue Sep 24, 2021 · 4 comments
Open

Questions about Dataset class Interface and dataset prepare #32

gony-noreply opened this issue Sep 24, 2021 · 4 comments

Comments

@gony-noreply
Copy link
Contributor

Question about dataset class Interface

I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.

I am trying to access the file directly by using get_dataset_fn method instead of get_dataset_iterator method, and I wonder if this is not an issue.
If possible, there seems to be something wrong with the implementation of the get_dataset_fn method for small datasets.

In the get_dataset_fn method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.


Qustion about dataset prepare

dataset.prepare(True) # prepare dataset, but skip potentially huge base vectors

I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.

@gony-noreply
Copy link
Contributor Author

The original second question was during the build phase of the benchmark process. It was a question about whether I should prepare data (downlaod) myself for build(In the benchmark code, skipdata is set to True)

But someone might be wondering if it is possible to use dataset vectors at the time of search.

@harsha-simhadri
Copy link
Owner

In evaluation, the dataset is not available. For T2, the index can store a copy of the data (or a compressed version) as part of the 1TB limit on index size
In index build it is, and contributes to the total storage limit.

@maumueller
Copy link
Collaborator

When I look at get_dataset_fn, it seems to me that it returns the actual file in case you are working with a cropped version.

https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/datasets.py#L268-L276. This should be safe to use. Maybe I'm misunderstanding your question, @gony0 ?

In general, you will have to take care of dowloading the base vectors by explicitly using python create_dataset.py --dataset .... However, it would be easy to add an argument to run.py so that it takes care of this. I will happily do that if this seems to be a common use case.

@gony-noreply
Copy link
Contributor Author

@maumueller
Sorry for the late reply

it seems to me that it returns the actual file in case you are working with a cropped version.

If there are both a 1B size file and a crop file, a 1B size file path is always returned.
see below codes.

fn = os.path.join(self.basedir, self.ds_fn)
if os.path.exists(fn):
return fn
if self.nb != 10**9:

ds_fn has a 1B size file name and creates a crop file name only when there is no 1B size file.

I talked about where this code didn't work well in my development environments(not actual competition evaluation environment) like both 1B dataset and small dataset were exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants