-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeeplakeDriver and Docs #68
Conversation
it = IterableSource(ds) if subset is None else IterableSource(ds[subset]) | ||
return it.map(lambda x: x.tensors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please especially take a look at this. I used the HubDriver as a template here, but by browsing the docs quickly I couldn't find any case where there are subsets in the Hub or Deeplake datasets. I.e. I think the subsets are handled directly inside the url as a suffix -train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's safe to remove this subset access in both hub and deeplake drivers, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, hub and deeplake "groups" seem to be similar to what we think of as a subset, and this API is the same across hub and deeplake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)
Also we should discuss whether to remove the HubDriver since it looks like Deeplake is the successor of Hub. On the other hand, users might still have Hub datasets that they want to use, so maybe having the Hub driver for a little longer doesn't hurt. |
docs/driver_integration.rst
Outdated
|
||
The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case. | ||
To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the "note" correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh you're right. there is even a hub dependency in there. But I was actually referring to the datasets
dependency (Huggingface). So should we remove both of those dependencies in the req.in and add another requirements.datasets.in
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would work, one other solution could be to remove the hub dependency from requirements.in
and have a separate requirements.huggingface.in
with hub, deeplake and datasets dependencies in one file. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't each dependency [datasets, deeplake, hub, torchvision] simply be in their own requirements.X.in
file? not all in one file, I would suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's also fine. It's more of a design decision I think, if we believe there are realistic use cases for someone to use any of those components in isolation, i.e. somone using the hub driver but not needing access to datasets. If this is the case, it probably makes sense to separate them. Feel free to choose which way you think is the most appropriate :)
tests ran through successfully @winfried-ripken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @axkoenig! This is a great addition to squirrel-datasets-core :)
@@ -32,9 +32,9 @@ For using the torchvision driver call: | |||
pip install "squirrel-core[torch]" | |||
pip install "squirrel-datasets-core[torchvision]" | |||
``` | |||
For using the hub driver call: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we keep both options to install hub and deeplake? Or is deeplake always preferred (then we should mention this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deeplake is prefered because it's as least as fast as hub was afaik. we keep hub for legacy users, but recommend/mention deeplake in the instructions without mentioning hub explicitly here I would say. there are some hub docs in the readthedocs documentation
docs/driver_integration.rst
Outdated
|
||
The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case. | ||
To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)
it = IterableSource(ds) if subset is None else IterableSource(ds[subset]) | ||
return it.map(lambda x: x.tensors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)
…/squirrel-datasets-core into ak-add-deeplake-driver
I decided to put every major optional dependency like (huggingface and hub) into a separate requirements.in file. This aligns with how we do it in the |
Sounds good to me :) Thank you! |
Description
Deeplake is a fast dataloading framework, and offers especially fast remote loading.
Below screenshot is from the Activeloop front webpage where the Squirrel
MessagepackDriver
comes out as very fast. The results are from this paper Figure 5. This PR adds native Deeplake support in Squirrel so our users get maximum speed with minimum effort.I also update the docs and Readme as shown below.
Type of change
Checklist: