Datasets

A comprehensive collection of parsers for different datasets

Requirements

You'll need python 3.9 or above, we highly recommend to use virtualenv

virtualenv .env --python=3.9
source ./.env/bin/activate
pip install -r requirements.txt

Writing a Parser

We provide several utilities to write a sharable parser. You'll have to subclass parses.Parser. An example is provided at parsers.cifar10. Before contributing please read the contributing guide.

A custom parser is a subclass of parsers.Parser. You must implement the following two methods

class MyParser(Parser):

    def parse_annotation(self, *args: Any, **kwargs: Any) -> ImageAnnotationFile:
        # your logic here

    def parse(self, root: Path):
        # your logic here

my_parser = MyParser(
        images_dir=Path("./images"),
        annotation_dir=Path("./annotations"),
        dataset_name="foo",
        path="/train",
)
# parse it
my_parser.parse(root=Path("./foo"))
my_parser.upload(os.environ["DARWIN_API_KEY"])
my_parser.upload_sample(os.environ["DARWIN_API_KEY"], n_samples=5)

Parser comes with special methods to upload the images, **due to a slow import problem on our hand you can test the correctness of your parser by using .upload_sample

For each dataset, you are expected to submit a PR that will be reviewed by us :)

Data types

Each parse has to return a darwin-json, to make thing easier we create a custom type in datatypes.py. You can create an AnnotationFile using the pre-defined data classes in there. Below we showcase how to create a simple annotation file with one bounding box and one tag

from parsers.datatypes import *

ann = ImageAnnotationFile(
    dataset="foo",
    image=Image(
        width=100,
        height=100,
        original_filename="hey",
        filename="hey"),
    annotations=[
        Annotation(name="a")
        .add_data(BoundingBox(x=1, y=2, h=10, w=10))
        .add_data(Tag())
    ],
)

Annotations can be easily converted to json using the dataclasses.asdict utility

from dataclasses import asdict
from pprint import pprint

pprint(asdict(ann))

{'annotations': [{'bounding_box': {'h': 10, 'w': 10, 'x': 1, 'y': 2},
                            'tag': {}},
                  'name': 'a'}],
 'dataset': 'foo',
 'image': {'filename': 'hey',
           'height': 100,
           'original_filename': 'hey',
           'path': None,
           'seq': None,
           'thumbnail_url': None,
           'url': None,
           'width': 100,
           'workview_url': None}}

To correctly specify a dataset split, e.g. 'train', you need to pass the path parameter to the Image type.

from parsers.datatypes import *

ann = ImageAnnotationFile(
    dataset="foo",
    image=Image(
        width=100,
        height=100,
        original_filename="hey",
        filename="hey",
        path="/train"), # <------ HERE!
    annotations=[
        Annotation(name="a")
        .add_data(BoundingBox(x=1, y=2, h=10, w=10))
        .add_data(Tag())
    ],
)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
parsers		parsers
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets

Requirements

Writing a Parser

Data types

About

Releases

Packages

Languages

v7labs/datasets

Folders and files

Latest commit

History

Repository files navigation

Datasets

Requirements

Writing a Parser

Data types

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages