-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about Dataset class Interface and dataset prepare #32
Comments
The original second question was during the build phase of the benchmark process. It was a question about whether I should prepare data (downlaod) myself for build(In the benchmark code, skipdata is set to True) But someone might be wondering if it is possible to use dataset vectors at the time of search. |
In evaluation, the dataset is not available. For T2, the index can store a copy of the data (or a compressed version) as part of the 1TB limit on index size |
When I look at https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/datasets.py#L268-L276. This should be safe to use. Maybe I'm misunderstanding your question, @gony0 ? In general, you will have to take care of dowloading the base vectors by explicitly using |
@maumueller
If there are both a 1B size file and a crop file, a 1B size file path is always returned. big-ann-benchmarks/benchmark/datasets.py Lines 283 to 286 in 59eab9f
ds_fn has a 1B size file name and creates a crop file name only when there is no 1B size file.
I talked about where this code didn't work well in my development environments(not actual competition evaluation environment) like both 1B dataset and small dataset were exists. |
Question about dataset class Interface
I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.
I am trying to access the file directly by using
get_dataset_fn
method instead ofget_dataset_iterator
method, and I wonder if this is not an issue.If possible, there seems to be something wrong with the implementation of the
get_dataset_fn
method for small datasets.In the
get_dataset_fn
method, if there is an original (1-billion) file, the path of the original file is returned. When used inget_dataset_iterator
method, it seems reasonable because only a part of the original file is used by mmap. However, ifget_dataset_fn
is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using theget_dataset_fn
method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.Qustion about dataset prepare
big-ann-benchmarks/benchmark/main.py
Line 145 in 8180e0e
I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.
The text was updated successfully, but these errors were encountered: