-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: File format loaders #110
Comments
This sounds great. One small thing is that I think we should have def get_loader_for_extension(path: str) -> FileLoaderBase: ...
def get_loader_for_mime_type(path: str) -> FileLoaderBase: ... instead of specific ones for each loader subcategory, since the consumer will just be iterating through a tree of files, and each particular file should only be loaded by one |
or even def get_loader_for(path: str) -> FileLoaderBase: ... |
OK, I'm going to start implementing this and see what falls out. Thanks for the g00d ideaz |
@calebho do you think it makes more sense to return the type or the constructed object? i.e. def get_loader_for(path: str) -> Type[FileLoaderBase]: ...
# OR
def get_loader_for(path: str) -> FileLoaderBase: ... |
import mimetypes
import os
from collections import defaultdict
from typing import DefaultDict, Sequence, Set, Type
from .base import FileLoaderBase
# global maps
extension_to_loaders_map: DefaultDict[str, Set[Type[FileLoaderBase]]] = defaultdict(set)
mime_type_to_loaders_map: DefaultDict[str, Set[Type[FileLoaderBase]]] = defaultdict(set)
def register_loader_for_extensions_and_mime_types(
extensions: Sequence[str],
mime_types: Sequence[str],
loader_cls: Type[FileLoaderBase],
) -> None:
"""Register a loader which recognizes a file in the format indicated by the given
extensions and MIME types.
Parameters:
extensions: A set of file extension, e.g. [".wav"], indicating the file format
mime_types: A set of MIME types, e.g. ["audio/x-wav", "audio/wav"], also indicating the file format
loader_cls: A file loader which knows how to load files with the given file extensions and MIME types
"""
pass
def get_loader_for(path: str) -> Sequence[Type[FileLoaderBase]]:
global extension_to_loaders_map
global mime_type_to_loaders_map
_, extension = os.path.splitext(path.lower())
mime_type, _ = mimetypes.guess_type(path)
# NB: Even if `mime_type` is `None`, the `defaultdict` will give us
# an empty set, so this won't break.
loader_clses_by_extension = extension_to_loaders_map[extension]
loader_clses_by_mime_type = mime_type_to_loaders_map[mime_type]
# NB: Use set-intersection (could potentially use set-union instead).
loader_clses = loader_clses_by_extension & loader_clses_by_mime_type
return loader_clses We may still need to do resolution here, since |
Yeah I meant to return
What's the use case for having multiple loaders for a single extension? |
What if someone wants to add their own custom |
Also, with cf17cff it turns out we don't want/need generics. The abstract Tests still need to be written for this stuff of course ... |
I think @mr-martian is actively working on file reader stuff, so it might be good to coordinate so you don't step on each others' feet. |
I'm thinking we'll incorporate that functionality from "upstream", and just
adapt it to the new API.
…On Thu, Dec 19, 2019, 12:38 PM Jonathan Washington ***@***.***> wrote:
I think @mr-martian <https://github.com/mr-martian> is actively working
on file reader stuff, so it might be good to coordinate so you don't step
on each others' feet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#110?email_source=notifications&email_token=ACLOIICD6L6TLX7MXSQSNGDQZPLTRA5CNFSM4J42N2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHK3L4Q#issuecomment-567653874>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLOIIG56WIVDLWNHGWIBSTQZPLTRANCNFSM4J42N2BQ>
.
|
Okay, you guys working in a branch? |
Well, we've been committing mostly to the |
Two complications that might be relevant:
|
Do we generate that metadata file ourselves? |
No, it is generated by the recording device. |
Ok, so we should require that file to also be present in order to decide |
Yes. |
Thanks! Will keep this in mind |
Then they would overwrite the default using the registry API. It doesn't make much sense to me to have more than one loader at once for one file format. |
That's probably a sensible behavior. I responded to this more fully here. |
cc: @keggsmurph21
Sharing some thoughts on how
ultratrace2.model.files
could be implemented/refactored...The current approach is to have a class for each kind of file (alignment, image, and sound) containing loader implementations for the accepted file formats. In order to support a new file format for a given file kind, one needs to add a class definition to the respective class. For example, if I want to add support for AAC, I would need to go into
sound.py
and add a class definition underSound
.I propose that we do something a bit different which allows you to physically separate file format implementations from the kind of file they're loading while keeping them logically related using inheritance. We would then maintain a registry which maps MIME type/extension to a loader implementation. The hierarchy would look something like:
FileLoaderBase[TFileChunk]
AlignmentFileLoader
TextGridFileLoader
ImageFileLoader
DICOMFileLoader
SoundFileLoader
WAVFileLoader
FLACFileLoader
The base class is generic in
TFileChunk
because I suppose the end goal is to have an index operation on FileBundle which return the file chunk at that index for each kind of file. This index operation returns aTFileChunk
which would be something likeAlignmentChunk
for alignment files,ImageChunk
for image files, andSoundChunk
for sound files.We would then have
registry.py
which containsThis API might look something like
The final piece is dynamically instantiating the correct file loader given a file path. One would parse the file path for the extension, then use the
get_
functions to retrieve the appropriate loader or throw an exception if the file format is not supported. Given the appropriate loader, an instance is constructed using the file path.Implementing a new file format might look something like
This approach prevents 1k+ line files as the number of file format implementation grows which the current approach cannot avoid. We can now have one file format implementation per file which is much easier to grok IMO. It also opens up the possibility of exposing the registry API as a public API in which users who don't have control over this library can register their own file loader plugins.
The text was updated successfully, but these errors were encountered: