Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Change 'readdir' to return an iterator. #27450

Closed
wants to merge 4 commits into from

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Jun 6, 2018

cf #27393

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from iterator name bikeshedding, looks good to me.

base/file.jl Outdated
@@ -572,14 +572,41 @@ struct uv_dirent_t
typ::Cint
end

struct ReadDirIter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessarily abbreviated. How about ReadDirIterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@vtjnash
Copy link
Member

vtjnash commented Jun 6, 2018

It would be simpler (and faster and more memory efficient) to implement this as:

lazy_readdir(name) = (path for path in readdir(name))

Just my 2¢

@malmaud
Copy link
Contributor Author

malmaud commented Jun 6, 2018 via email

@vtjnash
Copy link
Member

vtjnash commented Jun 6, 2018

They are already all loaded into memory, this PR would just make accessing them slower.

Support for the lazy version was proposed a few years ago, so it might land soon(ish) in libuv master: libuv/libuv#416

@StefanKarpinski
Copy link
Member

I guess the question for us at this point is: do we want to expose an iterator interface to readdir even though the current libuv implementation is always eager in anticipation of libuv soon having a truly lazy implementation for which an iterator interface can be advantageous. I would say yes. The choice of implementation doesn't matter much and @malmaud's one here has the advantage of possibly not needing to change when libuv changes theirs.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 6, 2018 via email

@StefanKarpinski
Copy link
Member

It's hitting the file system. Speed doesn't matter.

@JeffBezanson
Copy link
Member

JeffBezanson commented Jun 6, 2018

I don't see why an iterator is better. Arrays support more operations, e.g. pmap. Directories hardly ever (never?) have so many entries that memory use would be prohibitive.

@StefanKarpinski
Copy link
Member

The fact that it's an issue for libuv would seem to indicate that there are cases where it matters. Otherwise they would presumably not be bothering to change this.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 7, 2018

Yes, I agree that the fact that most other libraries already do this lazily or have open issues from people running into trouble from it not being lazy should tell us something.

I think there's no practical limit in Linux anymore on how many files can be in a directory. Thus someone, somewhere, will run into performance issues that we can never remedy once we've committed to an API where readdir returns a fully reified list.

We could potentially just add another function like Jameson's lazy_readdir or whatever, if people want to keep readdir as it is.

@JeffBezanson
Copy link
Member

Ok, I'm convinced, we can make it an iterator.

@malmaud
Copy link
Contributor Author

malmaud commented Jun 7, 2018

Big mistake.

@StefanKarpinski
Copy link
Member

Note that to address the "double file system call" issue in nodejs/node#15699 we would need to return some kind of FileNode object that acts like a file path but caches stat information (which it already has from the readdir operation). It's unclear to me how relevant the "if we list large directories the whole process hangs" issue in nodejs/node#15699 is to Julia—after all, we can make readdir internally non-blocking and let other tasks do stuff while readdir collects directory names. Blocking isn't the huge problem in Julia that it is in JavaScript.

@StefanKarpinski
Copy link
Member

OTOH, I think it's just good practice to have streaming APIs to things that can be done incrementally. That will guide people naturally towards writing more scalable code for this sort of thing.

@StefanKarpinski
Copy link
Member

Bump, it's now or never. This is unlikely to break too terribly many usages since the most common usage is simply to iterate through the output of readdir...

@vtjnash
Copy link
Member

vtjnash commented Jul 10, 2018

It's still not clear to me how often you would need to use this function on directories with more than tens of millions of entries, such that you would want to slightly increase the cost and complexity of using readdir.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jul 10, 2018

If that's the call, ok. If it becomes an issue in the future, we could have readdir(dir, lazy=true) or something like it in the future.

@malmaud
Copy link
Contributor Author

malmaud commented Jul 11, 2018

Alright, shall we just do Stefan's suggestion then (presumably a type-stable variant) at some point then?

@malmaud malmaud closed this Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants