Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalescing filenames intended behavior #124

Closed
keggsmurph21 opened this issue Dec 31, 2019 · 6 comments
Closed

Coalescing filenames intended behavior #124

keggsmurph21 opened this issue Dec 31, 2019 · 6 comments
Assignees
Labels
data formats question Further information is requested

Comments

@keggsmurph21
Copy link
Collaborator

Currently, we attempt to associate / coalesce files by their filename. In the normal case, that means we will see a directory such as

project_root/
    |- file00.dicom
    |- file00.TextGrid
    |- file00.wav

and determine that there is a file00 container with three associated files, which we will use to populate our widgets. Note: in the 1.0 architecture, this is basically equivalent to Metadata::getFileLevel, and in the 2.0 architecture, this is handled in FileBundleList::build_from_dir.

For the majority of cases, this simple "extract filename and grab anything with relevant extensions" approach works fine, but I think we have undefined behavior in at least the following cases. I'm hoping y'all have some ideas about how we should handle them.

There's some more related discussion here, here, and here.

1. Same name, different extension

project_root/
    |- file00.mp3
    |- file00.wav

It doesn't make any sense to have both mp3 and wav "versions" of a file loaded -- which would play through our interface?

I think the sensible thing in this case is to have some sort of priority assigned to each loader of a specific subtype. At register-time, we could enforce that there are no "ties" in priority. Then, we could just

register(MP3Loader, priority=1)
register(WAVLoader, priority=2)

or something to that effect. In this case, we should probably still emit warnings that we're "ignoring" a potentially valid file.

2. Same name, same extension, different path

project_root/
    |- subdir00/
    |   |- file00.mp3
    |- subdir01/
        |- file00.mp3

This may be a pathological example that we just don't want to support. Coalescing on only filename (and ignoring the full path) allows us to support arbitrarily nested directories. If we change this behavior to tracking (some subset of) the path, we lose that benefit and our whole model goes out the window.

Sensible behavior here might be to enforce deterministic behavior using sorting, and then just keep the first (or last?) match, again emitting a warning to the user.

3. Symlinks

Do we want to use the name of the link or the name of the target? To me, I think using the link name makes more sense, since we may not have write permissions on the target (and thus not be able to change its name).

I'm going to submit a PR that defines some of this behavior shortly, but I figured this would be a good place for any discussion.

@keggsmurph21 keggsmurph21 added question Further information is requested data formats labels Dec 31, 2019
keggsmurph21 added a commit that referenced this issue Dec 31, 2019
Priority is per-(direct subclass of)FileLoaderBase -- i.e., each
concrete subclass of, say, SoundFileLoader must have a unique priority.

See #124 for more context :^)
@jonorthwash
Copy link
Collaborator

1. Same name, different extension

...

I think the sensible thing in this case is to have some sort of priority assigned to each loader of a specific subtype. At register-time, we could enforce that there are no "ties" in priority. Then, we could just

register(MP3Loader, priority=1)
register(WAVLoader, priority=2)

or something to that effect. In this case, we should probably still emit warnings that we're "ignoring" a potentially valid file.

This sounds exactly right. And imo the priority ranking should be something like flac > wav > others.

2. Same name, same extension, different path

project_root/
    |- subdir00/
    |   |- file00.mp3
    |- subdir01/
        |- file00.mp3

This may be a pathological example that we just don't want to support. Coalescing on only filename (and ignoring the full path) allows us to support arbitrarily nested directories. If we change this behavior to tracking (some subset of) the path, we lose that benefit and our whole model goes out the window.

Sensible behavior here might be to enforce deterministic behavior using sorting, and then just keep the first (or last?) match, again emitting a warning to the user.

I could see people having their files organised like that. Maybe something like

project_root/
    |- recording00/
    |   |- ultrasound.dcm
    |   |- audio.wav
    |- recording01/
        |- ultrasound.dcm
        |- audio.wav

Perhaps we could find a way to support this structure. In this case, recording00 and recording01 become the "file" names (that currently appear in the dropdown).

3. Symlinks

Do we want to use the name of the link or the name of the target? To me, I think using the link name makes more sense, since we may not have write permissions on the target (and thus not be able to change its name).

Yes, name of the link. Something I often do is symlink a bunch of non-correspondingly named things into a single directory or set of directories.

@keggsmurph21
Copy link
Collaborator Author

This is all implemented in #125.

This sounds exactly right. And imo the priority ranking should be something like flac > wav > others.

Currently, the SoundFileLoader priorities are

  • flac : 4
  • wav: 3
  • ogg: 2
  • mp3: 1

I could see people having their files organised like that. Maybe something like

project_root/
    |- recording00/
    |   |- ultrasound.dcm
    |   |- audio.wav
    |- recording01/
        |- ultrasound.dcm
        |- audio.wav

Perhaps we could find a way to support this structure. In this case, recording00 and recording01 become the "file" names (that currently appear in the dropdown).

I agree that this would make sense to support, although I think it's fairly difficult to automatically guess. For example, we would also theoretically want to support

project_root/
    |- recording00/
    |   |- raw/
    |      |- ultrasound.dcm
    |      |- audio.wav
    |- recording01/
        |- raw/
            |- ultrasound.dcm
            |- audio.wav

To me, this feels like a problem we should solve later.

Yes, name of the link. Something I often do is symlink a bunch of non-correspondingly named things into a single directory or set of directories.

👍

@calebho
Copy link
Collaborator

calebho commented Jan 2, 2020

Currently, the SoundFileLoader priorities are

  • flac : 4
  • wav: 3
  • ogg: 2
  • mp3: 1

Assigning integer priorities at register time seems a bit cumbersome and error prone. If you want to add a new file loader, you need to make O(# loaders) changes in the worst case.

What if we have a file which specifies the priorities as a list L where loader L[i] has higher priority over loader L[j] iff i < j? This way you have priority information in one place as opposed to scattered across many files. Then given a situation where you have file.ext1 and file.ext2, loader resolution would be a matter of iterating through this priority list and returning the first loader whose supported extensions is in the set {'.ext1', '.ext2'}.

For the same name different path case, I think it would make sense to prompt the user to make a choice in the UI since it's not obvious to me what a good default is.

@jonorthwash
Copy link
Collaborator

For the same name different path case, I think it would make sense to prompt the user to make a choice in the UI since it's not obvious to me what a good default is.

I think there's probably a way to do this automatically.

@keggsmurph21
Copy link
Collaborator Author

Currently, the SoundFileLoader priorities are

  • flac : 4
  • wav: 3
  • ogg: 2
  • mp3: 1

Assigning integer priorities at register time seems a bit cumbersome and error prone. If you want to add a new file loader, you need to make O(# loaders) changes in the worst case.

What if we have a file which specifies the priorities as a list L where loader L[i] has higher priority over loader L[j] iff i < j? This way you have priority information in one place as opposed to scattered across many files. Then given a situation where you have file.ext1 and file.ext2, loader resolution would be a matter of iterating through this priority list and returning the first loader whose supported extensions is in the set {'.ext1', '.ext2'}.

How would you handle user-defined loaders then? By just always giving them the highest priority and preventing them from registering conflicting loaders?

@jonorthwash
Copy link
Collaborator

This issue has been discussed. Anything further that's related can be brought up in new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data formats question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants