Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/issue-306-pr #965

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zamu-flowerpot
Copy link

This would resolve issue #306 with entry counts for directories.
This also resolves issue #964 as I had to define how to compare EntrySize::EntryCount and EntrySize::Bytes.

This adds a single dependency walkdir to handle counting the number of items in a directory cross-platform. Also, by default it doesn't follow symbolic links which seems fine.

`src/listing.rs`
===

* Add `EntrySize` struct
* Impl Display for EntrySize
* Change `Option<bytesize::ByteSize>` to `EntrySize` in `Entry`
* Add Entry::Directory size calculation utilizing WalkDir
* Sort directory listings by always putting directories at the top and
  sorting files and directories based on their own semantics (ie. entry
  counts for directories and bytes for files)

`src/renderer.rs`
===

* Change `entry_raw` from using `if-let` to just operating on `entry.size`
@zamu-flowerpot
Copy link
Author

zamu-flowerpot commented Nov 9, 2022

One thing I didn't look into was setting these behind CLI flags and this probably needs some memoization as it can be just slightly noticeable when looking at parents with a lot of children (or maybe just limiting it down to direct children).

@svenstaro
Copy link
Owner

I think this should probably be behind CLI flags as the performance hit can be severe and we don't want that by default. I will do a proper review a little later.

@svenstaro
Copy link
Owner

Cool stuff! As you already said, a memoization technique would be neat. I think a HashMap would be good enough for us. What do you also think about timing out on directories that just take too long? I don't miniserve to ever not feel snappy when generating pages.

Comment on lines +517 to +530
// @if let size = entry.size {
// span.mobile-info.size {
// (maud::display(size))
// }
// }
}
}
}
}
td.size-cell {
@if let Some(size) = entry.size {
(maud::display(size))
}
(maud::display(&entry.size))
// @if let size = entry.size {
// (maud::display(size))
// }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the commented code here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't remove that since it was the code that would need to be switched on if we were going to throw everything behind a feature flag.

Comment on lines +333 to +335
// e2.size
// .unwrap_or_else(|| ByteSize::b(0))
// .cmp(&e1.size.unwrap_or_else(|| ByteSize::b(0)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this.

src/listing.rs Outdated
Comment on lines 83 to 100
#[derive(Clone, Copy, PartialEq, Eq)]
/// Possible entry size types
pub enum EntrySize {
/// EntryCount is number of entries in a directory
EntryCount(usize),
/// Bytes is number of bytes in a file
Bytes(bytesize::ByteSize),
}

impl Display for EntrySize {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
EntrySize::EntryCount(count) => write!(f, "{}", count),
EntrySize::Bytes(bytes) => write!(f, "{}", bytes),
}
}
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be cool to also show the combined directory file size?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are referring to by "combined directory file size". Are you referring to the total size in bytes of a directory? That might be a bit expensive and probably would require a persistent cache to be anywhere near fast.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the byte sizes. I recognize it might be too slow. Hm. If you like to, you could try to experiment with this.

@zamu-flowerpot
Copy link
Author

Cool stuff! As you already said, a memoization technique would be neat. I think a HashMap would be good enough for us. [...]

A HashMap would be fine. I tried my hand at using a once_cell::OnceCell<HashMap>, but couldn't figure out how to make a mutable global static. Maybe it'd just be as simple as just constructing it and passing it in as part of the function call to directory_listing?

[...] What do you also think about timing out on directories that just take too long? I don't miniserve to ever not feel snappy when generating pages.

Definitely. We might be able to shortcut a lot of the issues by just counting direct children by changing the size calculation to call .max_depth(1). It still might be slow, but it's on the user if they have a directory with a million files right?

src/listing.rs Outdated
@@ -256,12 +276,14 @@ pub fn directory_listing(
Err(_) => None,
};

let size = WalkDir::new(entry.path()).into_iter().count();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, we might be able to just do .take(UPPER_LIMIT).count() instead of .count() to shortcut execution time. So, we might have UPPER_LIMIT be something like 10001 and if it's more than 10000 we have the Impl Display for EntrySize print >10000 or something along those lines.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option would have been to add an actual timeout, but I'm not really sure how to best do that in a non-async context.

@svenstaro
Copy link
Owner

Cool stuff! As you already said, a memoization technique would be neat. I think a HashMap would be good enough for us. [...]

A HashMap would be fine. I tried my hand at using a once_cell::OnceCell<HashMap>, but couldn't figure out how to make a mutable global static. Maybe it'd just be as simple as just constructing it and passing it in as part of the function call to directory_listing?

You're going to need a OnceCell<Arc<Mutex<HashMap>>>.

[...] What do you also think about timing out on directories that just take too long? I don't miniserve to ever not feel snappy when generating pages.

Definitely. We might be able to shortcut a lot of the issues by just counting direct children by changing the size calculation to call .max_depth(1). It still might be slow, but it's on the user if they have a directory with a million files right?

How about walking through the tree in threads using rayon? Should be fairly easy and it might also allow you to cancel/timeout threads.

`Cargo.lock` & `Cargo.toml`
===

* Remove `WalkDir`
* Add `fs_extra`
* Add `OnceCell`

`src/listing.rs`
===

* remove `use walkdir::WalkDir`
* add HashMap, Arc, Mutex, OnceCell, log::warn, and `FILE_SIZE_CACHE`
* Cargo fmt
`src/listing.rs`
===

* Cargo fmt
* Replace `WalkDir::new(entry.path()).into_iter().count()` with
  calculating directory size using `fs_extra::dir::get_size`
  * If `get_size` fails, fall back to `std::fs::read_dir` counting
@zamu-flowerpot
Copy link
Author

zamu-flowerpot commented Nov 21, 2022

You're going to need a OnceCell<Arc<Mutex<HashMap>>>.

Yep, worked like a charm. Thanks for the hint.

How about walking through the tree in threads using rayon? Should be fairly easy and it might also allow you to cancel/timeout threads.

Tried rayon and then encountered BurntSushi/walkdir#21. This lead me to look for something other than WalkDir and found fs_extra.

I just wanted to get this to a MVP and figure out what flags we want to incorporate (and how to wire them from the config/cli to the call site). So for now I left off the parallel scanning.

Looking at fs_extras code, it might be doable to parallelize the size scanning using Rayon. If we get all the decedent DirEntrys and push them onto a stack, for_each over the stack and cache their size, and then calculate the directory sizes from the cache on request, we should be pretty fast. The biggest issue still comes down to cache invalidation though it wouldn't be too hard to change the cache map to include modified time.

That said another option is to spawn a thread that listens for file system notifications and updates the cache on writes, but I'm only aware of Linux's inotify so I'm not sure if it's really cross-platform.

Let me know if you have any questions or concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants