Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFC Explicitly close file descriptors in emcc.py #14074

Merged
merged 18 commits into from
Jun 16, 2021

Conversation

rth
Copy link
Contributor

@rth rth commented May 2, 2021

This explicitly closes file descriptors in emcc.py.

So instead of doing,

out = open(file_path).read()

which relies on the CPython garbage collection to eventually close the file descriptor, one can do,

from pathlib import Path

out = Path(file_path).read_text()

and similar pathlib methods, to close the file descriptor internally which is generally recommended (cf https://stackoverflow.com/q/7395542/1791279)

Not closing file descriptors can in particular lead to subtle issues of not knowing when exactly the written data is going to be flushed on disk, and it's not going to work well with other Python implementations (e.g. PyPy) which don't use reference counting.

@sbc100
Copy link
Collaborator

sbc100 commented May 2, 2021

Great! Thanks for working on this.

We've been moving away from using raw open's for a while now and these days we have have quite a lot of usage of the with open as f: method throughout the toolchain. As a followup change perhaps we should go back and have them use Path.. too?

Actually I have 3 different question/thoughts:

  1. We should probably apply this change to the entire codebase not just emcc.py?
  2. Should we have read_text/write_text/read_binary/write_binary helpers to shared.py to avoid having to do from pathlib import Path everywhere and give is a single point of injection for all file reading and writing in the toolchain? (this would allow us to add logging for example).
  3. Should we also change all the places where we do with open(..) as f: followed by single call to read or write? I vote yes, but maybe as followup?

I suggest we do (1) and (2) as part of this PR (I could be persuaded that we don't need (2) .. but I kind of like that idea) with (3) as a followup?

(We normally mark our non-functional changes with the NFC suffix rather than that MAINT prefix.. just because this matches what llvm does, not because its better necessarily).

@rth rth changed the title MAINT Explicitly close file descriptors in emcc.py NFC Explicitly close file descriptors in emcc.py May 2, 2021
@rth
Copy link
Contributor Author

rth commented May 3, 2021

We should probably apply this change to the entire codebase not just emcc.py?

Yeah, I can apply it to other files, once there is an agreed upon solution. I started small since I wasn't sure there would be interest in this change.

Should we have read_text/write_text/read_binary/write_binary helpers to shared.py to avoid having to do from pathlib import Path everywhere and give is a single point of injection for all file reading and writing in the toolchain?

IMO the ideal case for new code that does path manipulations is to use Path objects directly, that way you could do file_path.read_text() directly without the need of helpers. A blog on that subject.

For an existing large code base as in emscripten, though I'm not certain doing such migration is worth the effort. There is always some risk of bugs introduced by such migration and one has to rely on test coverage being high and/or some type annotations.

More generally I think I would still prefer to write Path(...).read_text() (or open(...).read()) rather than read_text(...) because Path is becoming a fairly common stdlib object in Python 3 code bases and it's easier to understand what it does (or be aware of other methods it has) than a custom function that adds another layer of abstraction. If some use of Path objects is accepted in the future, it would also make it easier to interact with code around it.

For instance, something like

if not os.path.exists(file_path):
   # handle it

return open(os.path.join(file_path, 'bin')).read()

could be written as,

if not file_path.exits():
   # handle it

return (file_path / 'bin').read_text()

assuming the file_path was converted to Path in the beginning.

Should we also change all the places where we do with open(..) as f: followed by single call to read or write? I vote yes, but maybe as followup?

TBH, I saw people using Path.read_text only recently, so maybe I'm missing some potential issues, but otherwise it does sound like a good idea (for a follow up PR).

@sbc100
Copy link
Collaborator

sbc100 commented May 3, 2021

I hear you points about transitioning to Path everwhere and maybe that should be goal. Even if that is a goal I still think is nice as an interim to have helper that take strings.

i.e. in sharedpy:

def read_text(filename):
   Path(filename).read_text()

This would be for use by any code that still uses str rather than Path and would

  1. make it obvious which code has yet to be converted to Path
  2. avoid importing Path into code that hasn't already been coverted.

It also adds a nice interception point, but that applies even after a potential conversion to Path everywhere.

I'm not sure I'm quite ready for file_path / 'bin' but maybe I'm just too old fashioned :) I'll come around to it I'm sure.

@rth
Copy link
Contributor Author

rth commented May 3, 2021

OK, that works, I'll add helper functions for it.

sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
@sbc100
Copy link
Collaborator

sbc100 commented May 13, 2021

I keep a stab and using pathlib in out testing code: #14175

Are you planning on updating this PR?

sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 13, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 14, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
@rth
Copy link
Contributor Author

rth commented May 14, 2021

Great! Yes, I'll update it in the next few days.

sbc100 added a commit that referenced this pull request May 14, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
sbc100 added a commit that referenced this pull request May 14, 2021
There are plans underway to start using pathlib more in emscripten.
See: #14074

This change a designed to test the water by using `pathlib` in test
code.

This allows us to use unix-like paths throughout the tests that will
get converted to windows-paths on windows machines automatically
by the pathlib library.
@sbc100
Copy link
Collaborator

sbc100 commented Jun 12, 2021

Are you still interested in updating this PR?

@rth
Copy link
Contributor Author

rth commented Jun 12, 2021

Thanks for your patience. I am interested, just with limited availability.

I made the discussed changes and applied them to all the files.

  • half the way through I realized that you already defined such helper functions in
    def read_file(*path_components):
    . I'm not sure where it would make sense to keep them in the end.
  • because tools is not an installable Python package and one has to use sys.path hacks to be able to use shared.py. Often I find it simpler and more reliable to use the open context manager directly rather than apply fixes to be able to import shared.py. It also avoids to add coupling between modules that are independent now. Similarly in other places I used the open context manager to avoid circular import issues with shared.py

Working on fixing CI.

emrun.py Show resolved Hide resolved
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

My main feedback (duplicated in comments):

  1. Can we keep the read_text read_binary names?
  2. Avoid adding new imports of shared.py
  3. Always prefer Path(..).read() over context manager (to avoid extra lines and indentation)?

emrun.py Show resolved Hide resolved
site/source/conf.py Show resolved Hide resolved
site/source/get_wiki.py Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
@@ -106,7 +106,8 @@ def parse_config_file():
Also check EM_<KEY> environment variables to override specific config keys.
"""
config = {}
config_text = open(EM_CONFIG, 'r').read()
with open(EM_CONFIG) as fh:
config_text = fh.read()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the helper functions inutils.py you can use them from here.


sys.path.insert(1, str(Path(__file__).parents[2].resolve()))

from tools import shared
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I don't think its worth added a new dependency on shared to get these utils. Just use Path.read here maybe?


sys.path.insert(1, __rootpath__)

from tools import shared
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

exec(open(path_from_root('tools', 'shared.py'), 'r').read())

with open(path_from_root('tools', 'shared.py'), 'r') as fh:
exec(fh.read())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path.read?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 12, 2021

Appreciate you taking the time work on this

rth added 2 commits June 13, 2021 08:25
and read_bytes -> read_binary
@@ -95,6 +94,8 @@ def make_fake_llc(filename, targets):

SANITY_MESSAGE = 'Emscripten: Running sanity checks'

EMBUILDER = path_from_root('embuilder.py')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean move this definition out of runner.py?

@@ -24,7 +23,7 @@
from tools import response_file

SANITY_FILE = shared.Cache.get_path('sanity.txt')
commands = [[EMCC], [path_from_root('tests/runner'), 'blahblah']]
commands = [[EMCC], [path_from_root('tests', 'runner'), 'blahblah']]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to do revert that change to use "/" in pathsnames? Perhaps a bad merge?


__rootpath__ = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

sys.path.insert(1, str(Path(__file__).parents[1].resolve()))
from tools.shared import read_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be worth added the new dependency on shared.py here?

tools/shared.py Outdated
def read_binary(file_path):
"""Read from a file opened in binary mode"""
with open(file_path, 'rb') as fh:
text = fh.read()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return fh.read() here? (and above)

@rth rth marked this pull request as draft June 14, 2021 21:42
@rth
Copy link
Contributor Author

rth commented Jun 14, 2021

Yes sorry I should have switched to a Draft status -- the changes weren't complete and there were indeed some merge issues. Should be OK now.

I think I addressed all comments, except for keeping the helpers in shared.py instead of utils.py as I didn't want to change all imports again to save two call to open in tools/config.py. I also reverted all changes to tests/: those have their own helpers, and could be done in a separate PR. Similarly these two helpers could be merged into 1 under some location in a follow up PR if useful, though their accepted parameters are not exactly the same.

@rth rth marked this pull request as ready for review June 14, 2021 22:27
@rth
Copy link
Contributor Author

rth commented Jun 14, 2021

It looks like there are still some CI failures. I'll investigate tomorrow.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more comments but lgtm

@@ -55,7 +56,7 @@ def getExtendedAttribute(self, name): # noqa: U100
p.parse(r'''
interface VoidPtr {
};
''' + open(input_file).read())
''' + read_file(input_file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just shared.read_file to avoid that extra import above?

@@ -176,7 +178,7 @@ def remove_dead_entries(entries):

def read_dwarf_entries(wasm, options):
if options.dwarfdump_output:
output = open(options.dwarfdump_output, 'rb').read()
output = read_binary(options.dwarfdump_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Path directly here to keep this tool free to deps on the rest of emscripten?

@@ -28,7 +29,7 @@ def create(final):
shutil.rmtree(dest_path, ignore_errors=True)
shutil.copytree(source_path, dest_path)

open(os.path.join(dest_path, 'include', 'ogg', 'config_types.h'), 'w').write(config_types_h)
Path(dest_path, 'include', 'ogg', 'config_types.h').write_text(config_types_h)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have access to shared here and in the other ports files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have reached my limit for manual and tedious changes between Path.read_text open().read() and shared.read_file for this PR :) I understand not using Path would save 1 import here, but in the end it is a stdlib module, it's readable and has no other undesirable side effects. Please feel free to push the change if that's a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also with read_file I would have to either revert back to using os.path.join or change the `read_binary signature to accept path segments, and the latter gives me the feeling that this PR is never going to end :)

def write_file(file_path, text):
"""Write to a file opened in text mode"""
with open(file_path, 'w') as fh:
fh.write(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you chose to use context managers here rather than Path? I guess its slightly faster because it doesn't have to do the normalization stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well no particular reason not to use open either. It's still the canonical way to open files. Path is shorter indeed, but open would be easier to adapt to specify the size parameter, for instance in read, which is used in a few paces of the code base.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 15, 2021

OK I think we can merge this and go from there. Thanks for your work!

@rth
Copy link
Contributor Author

rth commented Jun 15, 2021

Thanks for the review @sbc100 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants