Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add __hash__ method to FlyteFile to fix bug during interactive mode #2853

Merged
merged 3 commits into from
Oct 24, 2024

Conversation

granthamtaylor
Copy link
Contributor

This task will fail only in a notebook. It will not fail when running locally as a python script, running locally via pyflyte, or running remotely.

import flytekit as fk

@fk.task
def write_file(message: str) -> fk.FlyteFile:

    ff = fk.FlyteFile(path='myfile.txt')

    with open(ff, mode="w") as file:
        file.write(message)

    return ff

write_file(message='hello world')
> TypeError: Error encountered while executing '968396757.write_file':
  unhashable type: 'FlyteFile'

This can be fixed by adding the __hash__ method to FlyteFile. I am creating the hash from the serialized representation of the FlyteFile.

@granthamtaylor
Copy link
Contributor Author

The root cause is in Jupyter's interactiveshell.py, where it checks whether the opened file is one of stdin, stdout, stderr. During this check, the file is compared against a set (or dictionary keys) that contains these standard I/O objects.

Since sets (and dictionary keys) rely on hashing for membership checks, Jupyter attempts to hash the opened file. FlyteFile is not hashable, so it throws an error.

@granthamtaylor granthamtaylor enabled auto-merge (squash) October 23, 2024 16:27
Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 38.27%. Comparing base (3fc51af) to head (b63e9c6).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/types/file/file.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2853      +/-   ##
==========================================
- Coverage   45.53%   38.27%   -7.26%     
==========================================
  Files         196      196              
  Lines       20418    20473      +55     
  Branches     2647     2650       +3     
==========================================
- Hits         9298     7837    -1461     
- Misses      10658    12427    +1769     
+ Partials      462      209     -253     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@granthamtaylor granthamtaylor merged commit 8017ae8 into master Oct 24, 2024
105 of 106 checks passed
@wild-endeavor
Copy link
Contributor

@Mecoli1219 did some digging around this and found that jupyter is checking to see if this is stdin, out, err.
image (5)

I think this is fine, but in the interest of rigor, i don't see a harm in including all the fields of the object (excluding the downloader function but including the downloaded bit) as part of the hash computation. what do you think @Mecoli1219?

Also this change is not needed for the FlyteDirectory type right? That also has a __fspath__ function.

@Mecoli1219
Copy link
Contributor

I am not really sure, but I think that there wouldn't be a use case where there are two FlyteFiles in the same process that have the same path with different downloader functions. If there exists, that would be really weird. Let's say we create two FlyteFile:

f1 = FlyteFile("test.txt", downloader=...)
f2 = FlyteFile("test.txt", downloader=...)

The order of executing f1.download() and f2.download() will cause result different.

If the previous statement is right, we would only have two FlyteFiles with the same downloader function, and this also means that two FlyteFiles should be the same FlyteFile. However, it is possible that the downloaded bits are different in two instances (Just think that one calls download() and another doesn't). If we consider the downloaded bit in the hash function, this would probably cause two files to be not the same after hashing.

So I don't think I'll include the downloaded bit here, but I am not entirely familiar with the usage of FlyteFile, this is just my current thought.

@granthamtaylor
Copy link
Contributor Author

Also this change is not needed for the FlyteDirectory type right? That also has a fspath function.

I tried a number of operations and I couldn't produce any unexpected behavior.

Yes, you are right, I should have used a more comprehensive hash function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants