Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeeplakeDriver and Docs #68

Merged
merged 7 commits into from
Dec 13, 2022
Merged

Add DeeplakeDriver and Docs #68

merged 7 commits into from
Dec 13, 2022

Conversation

axkoenig
Copy link
Contributor

@axkoenig axkoenig commented Nov 4, 2022

Description

Deeplake is a fast dataloading framework, and offers especially fast remote loading.
Below screenshot is from the Activeloop front webpage where the Squirrel MessagepackDriver comes out as very fast. The results are from this paper Figure 5. This PR adds native Deeplake support in Squirrel so our users get maximum speed with minimum effort.

Screen Shot 2022-11-04 at 10 51 38

I also update the docs and Readme as shown below.
Screen Shot 2022-11-04 at 10 58 54

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the contributing guideline doc (external contributors only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

Comment on lines +35 to +36
it = IterableSource(ds) if subset is None else IterableSource(ds[subset])
return it.map(lambda x: x.tensors)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please especially take a look at this. I used the HubDriver as a template here, but by browsing the docs quickly I couldn't find any case where there are subsets in the Hub or Deeplake datasets. I.e. I think the subsets are handled directly inside the url as a suffix -train

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's safe to remove this subset access in both hub and deeplake drivers, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, hub and deeplake "groups" seem to be similar to what we think of as a subset, and this API is the same across hub and deeplake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)

@axkoenig
Copy link
Contributor Author

axkoenig commented Nov 4, 2022

Also we should discuss whether to remove the HubDriver since it looks like Deeplake is the successor of Hub. On the other hand, users might still have Hub datasets that they want to use, so maybe having the Hub driver for a little longer doesn't hurt.


The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case.
To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "note" correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)

Copy link
Contributor Author

@axkoenig axkoenig Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh you're right. there is even a hub dependency in there. But I was actually referring to the datasets dependency (Huggingface). So should we remove both of those dependencies in the req.in and add another requirements.datasets.in then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would work, one other solution could be to remove the hub dependency from requirements.in and have a separate requirements.huggingface.in with hub, deeplake and datasets dependencies in one file. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't each dependency [datasets, deeplake, hub, torchvision] simply be in their own requirements.X.in file? not all in one file, I would suggest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's also fine. It's more of a design decision I think, if we believe there are realistic use cases for someone to use any of those components in isolation, i.e. somone using the hub driver but not needing access to datasets. If this is the case, it probably makes sense to separate them. Feel free to choose which way you think is the most appropriate :)

@axkoenig
Copy link
Contributor Author

axkoenig commented Nov 7, 2022

tests ran through successfully @winfried-ripken

Copy link
Contributor

@winfried-ripken winfried-ripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @axkoenig! This is a great addition to squirrel-datasets-core :)

@@ -32,9 +32,9 @@ For using the torchvision driver call:
pip install "squirrel-core[torch]"
pip install "squirrel-datasets-core[torchvision]"
```
For using the hub driver call:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we keep both options to install hub and deeplake? Or is deeplake always preferred (then we should mention this)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deeplake is prefered because it's as least as fast as hub was afaik. we keep hub for legacy users, but recommend/mention deeplake in the instructions without mentioning hub explicitly here I would say. there are some hub docs in the readthedocs documentation


The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case.
To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)

Comment on lines +35 to +36
it = IterableSource(ds) if subset is None else IterableSource(ds[subset])
return it.map(lambda x: x.tensors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)

@axkoenig
Copy link
Contributor Author

I decided to put every major optional dependency like (huggingface and hub) into a separate requirements.in file. This aligns with how we do it in the squirrel-core parent package. Closing this now :)

@axkoenig axkoenig merged commit 7dd7337 into main Dec 13, 2022
@axkoenig axkoenig deleted the ak-add-deeplake-driver branch December 13, 2022 16:24
@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2022
@winfried-ripken
Copy link
Contributor

I decided to put every major optional dependency like (huggingface and hub) into a separate requirements.in file. This aligns with how we do it in the squirrel-core parent package. Closing this now :)

Sounds good to me :) Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants