Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent errors from identifier functions for loaders #404

Merged
merged 1 commit into from
Feb 4, 2019

Conversation

hamogu
Copy link
Member

@hamogu hamogu commented Nov 27, 2018

A well-written identifier function will return True or False.
However, the existing identifier functions often make assumptions
about the file which may be wrong, e.g. they check the fits header
keyword "TELESCOP" and fail with a KeyError when that keyword does
not exist or the file is not valid fits.
The correct return would be False, as in "this is not a file
I can read", which then allows the registry to try the next format.

This fixes the problem for the existing default reader and adds a
test.

@nmearl
Copy link
Contributor

nmearl commented Nov 27, 2018

Fair point! But ambiguous try/excepts can be dangerous, if we know that they are there to catch improper fits header keywords, we should aim to either test directly for that case (just as you test whether 'TELESCOPE' exists in the header), or explicitly catch the error. I think it's in our interests to know when an identifier function is incorrect due to some unforeseen issue.

@hamogu
Copy link
Member Author

hamogu commented Nov 27, 2018

I disagree. Identifier functions should be tested to make sure they work, but they should work with whatever users throw at them and not raise Errors. Why would I even have an identifier function, if I can't use Spectrum1D.read(...) on an arbitrary file?
Any valid file should give True and any invalid file should give False. At that point the user should not have to worry about why it failed - the point is that the reader can't read it.
As a developer, I want all the errors in detail, but as a user, I want Spectrum1D.read(...) to work and not to fail in identify_spec_type_I_dont_care_about some identifier function that's totally unrelated to the data I have.

@hamogu
Copy link
Member Author

hamogu commented Nov 27, 2018

I think it's in our interests to know when an identifier function is incorrect due to some unforeseen issue.

That's in the developer's interest, but it's totally a show-stopper for a user. Developers can get around that by testing the identifier, see #405.

I came across this problem like this: I got a fits file from an optical observer. I don't know the exact format, so I tried spec = Spectrum1D.read(...) which fails with a a KeyError in the STIS reader. I happen to be involved in astropy and previous specutils programming, so I started to dig; but as a normal user I would have said "Mmm, seems like I can't use this unless I know the exact format.". So, I fixed the STIS reader problem with the keyword and tried again. Failure, because KeyError in the HST/COS file. Ok, so I fix that reader, too.

Then, I tried to read another spectrum, that someone had given me as a text file. So, I tried again spec = Spectrum1D.read(...), knowing that I just fixed the KeyErrors. Failure. "Not a valid fits file". Mmm, I know it's a text file, so that's OK, but seems like I can't use this package unless I know the exact format...

So, I enclosed all readers assuming fits format in try..except blocks. That's the PR you see here.

So, I strongly recommend to write identifiers that return True or False and don't break the chain of trying out the formats because someone uses them on a file format that we did not anticipate.

@nmearl
Copy link
Contributor

nmearl commented Nov 28, 2018

I still disagree with this approach. If the identifier function is unable to handle the file type gracefully, it should absolutely raise an error instead of obfuscating the reason why the registry didn't choose that loader.

Why would I even have an identifier function, if I can't use Spectrum1D.read(...) on an arbitrary file?

My argument is that arbitrary files should be a handled case, and not swallowed by simple all-encompassing exception. The point of the identifier function is to act as a filter, and a filter is only useful if the both the user and developer can determine its efficacy, otherwise it's impossible to know whether a loader is chosen or not due to the data actually not being a match, or some malformed header that blocks the appropriate loader from being selected.

I am agreeing that the try/except is decent means of handling this, I'm simply pointing out that the exceptions should be qualified and handled appropriately. If a file is a fits file and was generated from HST, and specifically from COS or STIS, and yet still fails, the user should be made aware that the appropriate loader is not being selected due to some issue with the data. This is the information that is being swallowed when implementing this type of try/except anti-pattern.

Your examples above would benefit from simply having more selective identifier functions, not necessarily blanket exception handling.

@hamogu
Copy link
Member Author

hamogu commented Nov 28, 2018

What valid reason do you imagine to have a file that "is a fits file and was generated from HST, and specifically from COS or STIS, and yet still fails"?
Also, if you already know what you are dealing with, you can set the format=... and your file will read (EDIT: or throw and error that will tell you what's wrong with it).

I suggest that we solicit more opinions on the strategy. If @nmearl view prevails, I'm happy to make this much narrower and only catch the errors that I specifically can produce now.

However, I can guarantee that this issue will return with files that fail e.g. the check if it's a fits file in some other way. Potentially, this can be a continuing source of frustration for users - unless we catch essentially all exceptions.

cc: @sargas @eteq @stscicrawford

@nmearl
Copy link
Contributor

nmearl commented Nov 28, 2018

What valid reason do you imagine to have a file that "is a fits file and was generated from HST, and specifically from COS or STIS, and yet still fails"?

There is no valid reason to have a file that one knows is of a particular flavor, and yet still fails. The point being that, in your case, the identifier would skip the appropriate loader instead of throwing an error. Either way, the user is left without a properly loaded data file. The difference is whether or not they know why it could not be parsed.

@nmearl
Copy link
Contributor

nmearl commented Nov 28, 2018

The solution may be as simple as being sure to log the error in the except block, not necessarily raising an exception. But if the true experience were to never raise errors in the identifier functions, then that implementation detail would've been codified upstream in the registry itself. I'm of the mind that catch-all anti-patterns should not be used when we certainly have the ability to be more selective.

@hamogu
Copy link
Member Author

hamogu commented Nov 28, 2018

I think logging any error caught is a good solution and I'll see how to do that. If you happen to know offhand how to write to the logger, I appreciate a small note!

But if the true experience were to never raise errors in the identifier functions, then that implementation detail would've been codified upstream in the registry itself.

Good idea. I'll open an issue in astropy itself.

@hamogu
Copy link
Member Author

hamogu commented Feb 1, 2019

@nmearl I rebased and added full exception logging to all catch-all try-except statements that I introduced.

@nmearl
Copy link
Contributor

nmearl commented Feb 1, 2019

The more I look at this, the more I think the try/except behavior should be moved to the decorator definition. That way, the user does not need to be responsible for wrapping their identifier function in a try/except block and log the failure, it would be automatically done when the loader and identifier function are added to the registry.

I would suggest that in the data loader definition, place your try/except behavior in a nested fuction and wrap the passed in identifier function:

def data_loader(label, identifier=None, dtype=Spectrum1D, extensions=None,
                priority=0):
    """
    ...
    """
    def identifier_wrapper(ident):
        def wrapper(*args, **kwargs):
             try:
                 return ident(*args, **kwargs)
             except Exception as e:
                 logging.debug("Tried to read this as {} file, but could not.".format(label))
                 logging.debug(e, exc_info=True)
                 return False
        return wrapper

    identifier = identifier_wrapper(identifier)

    def decorator(func):
        io_registry.register_reader(label, dtype, func)
        ...

@hamogu
Copy link
Member Author

hamogu commented Feb 1, 2019

Yes, a decorator sounds good for this case. In fact, once it's done here and works, I think I'll contribute that decorator to astropy, too. However, the fact that this problem has never come up in several years of astropy means that it is not urgent there. I think specutils is the first file that makes extensive use of identifier functions. Most other packages (including astropy itself) only look at the file extension, which does not raise exception.

If accepted into astropy, it could then be removed here again. However, I definitely want this in photutils now and not wait for an entire astropy development cycle plus the time until photutils requirements match the most recent version of astropy because I tried reading several different fits files form different sources and they all failed, so I think that this is probably a common problem.

I will turn this into a decorator some time this weekend.

@nmearl
Copy link
Contributor

nmearl commented Feb 1, 2019

For now, I'm suggesting that this can already be handled in the already existing decorator that is responsible for adding the loader and the identifier function to the registry. That decorator currently exists in specutils, so you would only need to add the functionality to wrap the identifier function (the little code snippet I offered). Certainly though, in the future, the decorator approach could be moved to astropy proper.

However, the existing identifier functions often make assumptions
about the file which may be wrong, e.g. they check the fits header
keyword "TELESCOP" and fail with a KeyError when that keyword does
not exist or the file is not valid fits.
The correct return would be False, as in "this is not a file
I can read", which then allows the registry to try the next format.

This fixes the problem for the existing default reader and adds a
test.
Copy link
Contributor

@nmearl nmearl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @hamogu.

@nmearl nmearl merged commit 91e2918 into astropy:master Feb 4, 2019
@hamogu hamogu deleted the nofailstis branch May 2, 2019 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants