Add DeeplakeDriver and Docs #68

axkoenig · 2022-11-04T09:58:23Z

Description

Deeplake is a fast dataloading framework, and offers especially fast remote loading.
Below screenshot is from the Activeloop front webpage where the Squirrel MessagepackDriver comes out as very fast. The results are from this paper Figure 5. This PR adds native Deeplake support in Squirrel so our users get maximum speed with minimum effort.

I also update the docs and Readme as shown below.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring including code style reformatting
Other (please describe):

Checklist:

I have read the contributing guideline doc (external contributors only)
Lint and unit tests pass locally with my changes
I have kept the PR small so that it can be easily reviewed
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
All dependency changes have been reflected in the pip requirement files.

axkoenig · 2022-11-04T10:02:53Z

src/squirrel_datasets_core/driver/deeplake.py

+            it = IterableSource(ds) if subset is None else IterableSource(ds[subset])
+            return it.map(lambda x: x.tensors)


please especially take a look at this. I used the HubDriver as a template here, but by browsing the docs quickly I couldn't find any case where there are subsets in the Hub or Deeplake datasets. I.e. I think the subsets are handled directly inside the url as a suffix -train

maybe it's safe to remove this subset access in both hub and deeplake drivers, wdyt?

nevermind, hub and deeplake "groups" seem to be similar to what we think of as a subset, and this API is the same across hub and deeplake

Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)

axkoenig · 2022-11-04T10:04:48Z

Also we should discuss whether to remove the HubDriver since it looks like Deeplake is the successor of Hub. On the other hand, users might still have Hub datasets that they want to use, so maybe having the Hub driver for a little longer doesn't hurt.

axkoenig · 2022-11-04T10:05:54Z

docs/driver_integration.rst


-The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case.
+To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it.


Is the "note" correct?

Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)

ahh you're right. there is even a hub dependency in there. But I was actually referring to the datasets dependency (Huggingface). So should we remove both of those dependencies in the req.in and add another requirements.datasets.in then?

Yes that would work, one other solution could be to remove the hub dependency from requirements.in and have a separate requirements.huggingface.in with hub, deeplake and datasets dependencies in one file. Wdyt?

shouldn't each dependency [datasets, deeplake, hub, torchvision] simply be in their own requirements.X.in file? not all in one file, I would suggest

Yes, that's also fine. It's more of a design decision I think, if we believe there are realistic use cases for someone to use any of those components in isolation, i.e. somone using the hub driver but not needing access to datasets. If this is the case, it probably makes sense to separate them. Feel free to choose which way you think is the most appropriate :)

axkoenig · 2022-11-07T12:48:27Z

tests ran through successfully @winfried-ripken

winfried-ripken

Thanks a lot @axkoenig! This is a great addition to squirrel-datasets-core :)

winfried-ripken · 2022-11-15T09:46:44Z

README.md

@@ -32,9 +32,9 @@ For using the torchvision driver call:
 pip install "squirrel-core[torch]"
 pip install "squirrel-datasets-core[torchvision]"
 ```
-For using the hub driver call:


Shouldn't we keep both options to install hub and deeplake? Or is deeplake always preferred (then we should mention this)?

Deeplake is prefered because it's as least as fast as hub was afaik. we keep hub for legacy users, but recommend/mention deeplake in the instructions without mentioning hub explicitly here I would say. there are some hub docs in the readthedocs documentation

winfried-ripken · 2022-11-15T09:58:54Z

docs/driver_integration.rst


-The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case.
+To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it.


Yes, the hub dependency is in here - we should change this and either remove the extra req.hub.in which has no point or remove the hub dependency from req.in alternatively. Wdyt? :)

winfried-ripken · 2022-11-15T10:03:18Z

src/squirrel_datasets_core/driver/deeplake.py

+            it = IterableSource(ds) if subset is None else IterableSource(ds[subset])
+            return it.map(lambda x: x.tensors)


Thanks for checking!
Yes, I think being able to access subsets / groups is still used, e.g. for wikitext here :)

…/squirrel-datasets-core into ak-add-deeplake-driver

axkoenig · 2022-12-13T16:24:23Z

I decided to put every major optional dependency like (huggingface and hub) into a separate requirements.in file. This aligns with how we do it in the squirrel-core parent package. Closing this now :)

winfried-ripken · 2022-12-13T16:29:23Z

I decided to put every major optional dependency like (huggingface and hub) into a separate requirements.in file. This aligns with how we do it in the squirrel-core parent package. Closing this now :)

Sounds good to me :) Thank you!

Add deeplake driver and docs

cf066f1

axkoenig commented Nov 4, 2022

View reviewed changes

axkoenig requested a review from winfried-ripken November 4, 2022 10:09

axkoenig added 2 commits November 4, 2022 11:20

Reqs

9c82126

Add tests

4e074da

winfried-ripken approved these changes Nov 15, 2022

View reviewed changes

axkoenig and others added 4 commits December 13, 2022 16:50

Update install instructions

f35a277

Merge branch 'main' into ak-add-deeplake-driver

70f74c4

Fix flake8 link

7d7e2e2

Merge branch 'ak-add-deeplake-driver' of github.com:merantix-momentum…

8dae399

…/squirrel-datasets-core into ak-add-deeplake-driver

axkoenig merged commit 7dd7337 into main Dec 13, 2022

axkoenig deleted the ak-add-deeplake-driver branch December 13, 2022 16:24

github-actions bot locked and limited conversation to collaborators Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeeplakeDriver and Docs #68

Add DeeplakeDriver and Docs #68

axkoenig commented Nov 4, 2022 •

edited

Loading

axkoenig Nov 4, 2022

axkoenig Nov 4, 2022

axkoenig Nov 4, 2022

winfried-ripken Nov 15, 2022

axkoenig commented Nov 4, 2022

axkoenig Nov 4, 2022

winfried-ripken Nov 15, 2022

axkoenig Nov 17, 2022 •

edited

Loading

winfried-ripken Nov 22, 2022

axkoenig Nov 24, 2022

winfried-ripken Nov 24, 2022

axkoenig commented Nov 7, 2022

winfried-ripken left a comment

winfried-ripken Nov 15, 2022

axkoenig Nov 17, 2022

winfried-ripken Nov 15, 2022

winfried-ripken Nov 15, 2022

axkoenig commented Dec 13, 2022

winfried-ripken commented Dec 13, 2022

		it = IterableSource(ds) if subset is None else IterableSource(ds[subset])
		return it.map(lambda x: x.tensors)


		The below examples show how to instantiate the three drivers and shows what they output. Note that we simply “forward” the output of these libraries, so the format of whatever they output may differ. For example, in the below code we take the first item of the pipeline with :code:`.take(1)` and we map a :code:`print` function over this pipeline, which outputs something different for each backend. The images coming from the Huggingface servers are :py:class:`PIL` images, while for Hub they are in their custom :py:class:`Tensor` format. The user should write corresponding pre-processing functions that suit their use-case.
		To use the drivers, you need to install :py:class:`squirrel-datasets-core` with the corresponding dependency. Note that the Huggingface dependency already comes pre-installed with :py:class:`squirrel-datasets-core`, because it's considered a core component of it.

Add DeeplakeDriver and Docs #68

Add DeeplakeDriver and Docs #68

Conversation

axkoenig commented Nov 4, 2022 • edited Loading

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axkoenig commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axkoenig Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axkoenig commented Nov 7, 2022

winfried-ripken left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axkoenig commented Dec 13, 2022

winfried-ripken commented Dec 13, 2022

axkoenig commented Nov 4, 2022 •

edited

Loading

axkoenig Nov 17, 2022 •

edited

Loading