Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: File format loaders #110

Open
calebho opened this issue Dec 19, 2019 · 20 comments
Open

RFC: File format loaders #110

calebho opened this issue Dec 19, 2019 · 20 comments

Comments

@calebho
Copy link
Collaborator

calebho commented Dec 19, 2019

cc: @keggsmurph21

Sharing some thoughts on how ultratrace2.model.files could be implemented/refactored...

The current approach is to have a class for each kind of file (alignment, image, and sound) containing loader implementations for the accepted file formats. In order to support a new file format for a given file kind, one needs to add a class definition to the respective class. For example, if I want to add support for AAC, I would need to go into sound.py and add a class definition under Sound.

I propose that we do something a bit different which allows you to physically separate file format implementations from the kind of file they're loading while keeping them logically related using inheritance. We would then maintain a registry which maps MIME type/extension to a loader implementation. The hierarchy would look something like:

  • FileLoaderBase[TFileChunk]
    • AlignmentFileLoader
      • TextGridFileLoader
      • ...
    • ImageFileLoader
      • DICOMFileLoader
      • ...
    • SoundFileLoader
      • WAVFileLoader
      • FLACFileLoader
      • ...

The base class is generic in TFileChunk because I suppose the end goal is to have an index operation on FileBundle which return the file chunk at that index for each kind of file. This index operation returns a TFileChunk which would be something like AlignmentChunk for alignment files, ImageChunk for image files, and SoundChunk for sound files.

We would then have registry.py which contains

  • a module-level dictionary mapping an extension to a class which knows how to load that file format and
  • an API to interact with this dictionary.

This API might look something like

def register_loader_for_extension(
    extension: str, loader_cls: Type[FileLoaderBase]
) -> None:
    """Register a loader which recognizes a file in the format indicated by the given 
    extension.

    Parameters:
        extension: A file extension, e.g. ".wav", indicating the file format
        loader_cls: A file loader which knows how to load files in the given file format
    """
    ...

def register_loader_for_mime_type(
    mime_type: str, loader_cls: Type[FileLoaderBase]
) -> None:
    """Analogous to `register_loader_for_extension`, but for MIME types."""
    ...

def get_alignment_loader_for_extension(extension: str) -> AlignmentFileLoader:
    ...

def get_sound_loader_for_extension(extension: str) -> SoundFileLoader:
    ...

def get_image_loader_for_extension(extension: str) -> ImageFileLoader:
    ...

# ... Analogous `get_(alignment|sound|image)_loader_for_mime_type` ...

The final piece is dynamically instantiating the correct file loader given a file path. One would parse the file path for the extension, then use the get_ functions to retrieve the appropriate loader or throw an exception if the file format is not supported. Given the appropriate loader, an instance is constructed using the file path.

Implementing a new file format might look something like

# aac.py
from .registry import register_loader_for_extension, register_loader_for_mime_type
from .base import SoundFileLoader

class AACFileLoader(SoundFileLoader):
    def load(self) -> None:
        ...

    def get_chunk(self, idx: int) -> Optional[SoundChunk]:
        ...

register_loader_for_extension(".aac", AACFileLoader)
register_loader_for_mime_type("audio/aac", AACFileLoader)

This approach prevents 1k+ line files as the number of file format implementation grows which the current approach cannot avoid. We can now have one file format implementation per file which is much easier to grok IMO. It also opens up the possibility of exposing the registry API as a public API in which users who don't have control over this library can register their own file loader plugins.

@keggsmurph21
Copy link
Collaborator

This sounds great. One small thing is that I think we should have

def get_loader_for_extension(path: str) -> FileLoaderBase: ...
def get_loader_for_mime_type(path: str) -> FileLoaderBase: ...

instead of specific ones for each loader subcategory, since the consumer will just be iterating through a tree of files, and each particular file should only be loaded by one Loader.

@keggsmurph21
Copy link
Collaborator

or even

def get_loader_for(path: str) -> FileLoaderBase: ...

@keggsmurph21
Copy link
Collaborator

OK, I'm going to start implementing this and see what falls out. Thanks for the g00d ideaz

@keggsmurph21
Copy link
Collaborator

@calebho do you think it makes more sense to return the type or the constructed object? i.e.

def get_loader_for(path: str) -> Type[FileLoaderBase]: ...
# OR
def get_loader_for(path: str) -> FileLoaderBase: ...

@keggsmurph21
Copy link
Collaborator

import mimetypes
import os

from collections import defaultdict
from typing import DefaultDict, Sequence, Set, Type

from .base import FileLoaderBase


# global maps
extension_to_loaders_map: DefaultDict[str, Set[Type[FileLoaderBase]]] = defaultdict(set)
mime_type_to_loaders_map: DefaultDict[str, Set[Type[FileLoaderBase]]] = defaultdict(set)


def register_loader_for_extensions_and_mime_types(
    extensions: Sequence[str],
    mime_types: Sequence[str],
    loader_cls: Type[FileLoaderBase],
) -> None:
    """Register a loader which recognizes a file in the format indicated by the given 
    extensions and MIME types.

    Parameters:
        extensions: A set of file extension, e.g. [".wav"], indicating the file format
        mime_types: A set of MIME types, e.g. ["audio/x-wav", "audio/wav"], also indicating the file format
        loader_cls: A file loader which knows how to load files with the given file extensions and MIME types
    """
    pass


def get_loader_for(path: str) -> Sequence[Type[FileLoaderBase]]:

    global extension_to_loaders_map
    global mime_type_to_loaders_map

    _, extension = os.path.splitext(path.lower())
    mime_type, _ = mimetypes.guess_type(path)

    # NB: Even if `mime_type` is `None`, the `defaultdict` will give us
    #     an empty set, so this won't break.
    loader_clses_by_extension = extension_to_loaders_map[extension]
    loader_clses_by_mime_type = mime_type_to_loaders_map[mime_type]

    # NB: Use set-intersection (could potentially use set-union instead).
    loader_clses = loader_clses_by_extension & loader_clses_by_mime_type

    return loader_clses

We may still need to do resolution here, since loader_clses could potentially have len >1

@calebho
Copy link
Collaborator Author

calebho commented Dec 19, 2019

do you think it makes more sense to return the type or the constructed object?

Yeah I meant to return Type[...] originally.

We may still need to do resolution here, since loader_clses could potentially have len >1

What's the use case for having multiple loaders for a single extension?

@keggsmurph21
Copy link
Collaborator

We may still need to do resolution here, since loader_clses could potentially have len >1

What's the use case for having multiple loaders for a single extension?

What if someone wants to add their own custom WAVLoader? I'm not sure if we should support that, but it's worth consideration.

@keggsmurph21
Copy link
Collaborator

Also, with cf17cff it turns out we don't want/need generics. The abstract ImageSetFileLoader can just have an abstract ::get_frame(i: int) -> PIL.Image.Image method. We don't really need similar functionality from AlignmentFileLoader or SoundFileLoader.

Tests still need to be written for this stuff of course ...

@jonorthwash
Copy link
Collaborator

I think @mr-martian is actively working on file reader stuff, so it might be good to coordinate so you don't step on each others' feet.

@keggsmurph21
Copy link
Collaborator

keggsmurph21 commented Dec 19, 2019 via email

@jonorthwash
Copy link
Collaborator

Okay, you guys working in a branch?

@keggsmurph21
Copy link
Collaborator

Well, we've been committing mostly to the ultratrace2 directory. I am currently working in the manage-multiple-projects branch (#107), but am hoping to merge that in to master soon.

@mr-martian
Copy link
Contributor

mr-martian commented Dec 19, 2019

Two complications that might be relevant:

  • There are 3 distinct ways of loading .dicom files (reading from pixel data, reading pngs extracted from pixel data, and reading from the private use area)
  • Loading .ult files actually requires the loader to know about two different files: .ult and a metadata file for which I have the extension get set to US.txt

@kmurphy4
Copy link
Contributor

Do we generate that metadata file ourselves?

@mr-martian
Copy link
Contributor

No, it is generated by the recording device.

@kmurphy4
Copy link
Contributor

Ok, so we should require that file to also be present in order to decide .ult files, right?

@mr-martian
Copy link
Contributor

Yes. xyz.ult comes with xyzUS.txt which is a plain text file containing the image dimensions and a few other things.

@kmurphy4
Copy link
Contributor

Thanks! Will keep this in mind

@calebho
Copy link
Collaborator Author

calebho commented Dec 20, 2019

What if someone wants to add their own custom WAVLoader? I'm not sure if we should support that, but it's worth consideration.

Then they would overwrite the default using the registry API. It doesn't make much sense to me to have more than one loader at once for one file format.

@keggsmurph21
Copy link
Collaborator

What if someone wants to add their own custom WAVLoader? I'm not sure if we should support that, but it's worth consideration.

Then they would overwrite the default using the registry API. It doesn't make much sense to me to have more than one loader at once for one file format.

That's probably a sensible behavior. I responded to this more fully here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants