-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching functionality (v2) #443
Conversation
Do I see it right, that the cache is stored as dedicated file |
pub(crate) type Cache = DashMap<Uri, CacheStatus>; | ||
|
||
pub(crate) trait StoreExt { | ||
fn store<T: AsRef<Path>>(&self, path: T) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is internal API, we don't need to make it generic (unless it's necessary).
I guess we can use &Path instead.
The potential benefit is faster compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Although I "benchmarked" it with a few compiles and I couldn't find a difference. Probably because the rest takes longer. 😆
Agreed. |
Storing cache result in CSV seems a straightforward choice, since we've mature and performant serializer/deserializer and it's human readable. But if we choose some other format such as apache arrow, it can be faster and the non-readability may even be desired by some users. |
After a quick glance I don't understand how cache invalidation works. You're invalidating on a per-file level, but the cached requests can be really old, no? I agree that the caching is a surprising default. If the caching is automatically set up to work in the gh action, it could be nice to have enabled by default though in CI, filesize matters, as the cache is downloaded/uploaded. Perhaps you can store bloom filters instead. I would never cache filesystem access (not sure if you're doing that) otherwise lgtm |
Yes indeed. We could add an individual timestamp for each cached request. It would be quite repetitive I guess. Any alternative ideas?
Good idea. We should enable it and add an example on how to use caching with the cache action.
Yeah that's true and I agree with you and @lebensterben here. We should make the cache generic. In the future I would also like to support Redis for cluster setups. 👍 We'll probably start with CSV and maybe bloom filters / xor filters and later add more variants.
Yeah I do, but I shouldn't. FS access is pretty fast and volatile so that was an oversight on my end. Nice catch! Thanks for all the great feedback @MichaIng, @lebensterben, and @untitaker. ❤️ Conclusions so far:
I'll add them as TODOs to the PR description. |
if stored as bloom filter, you need to store a bloom filter per
I think the entire performance tradeoff will hinge on CI, as having a cache in CI is really costly. I don't think it's necessary to support storing the cache in multiple formats or data stores. what do you get from redis support? do you want to do partial cache updates? then use a cache dir. why support both bloom and xor? is there a tradeoff I'm missing that prevents you from picking the universally best format+datastore? |
The idea was that if you run a cluster of lychees you could have a shared cache. This can be helpful when lychee is running behind an API endpoint, which I'm planning to do.
Yeah, partial updates. You might be right actually. We should try to get as far as possible with simple file-based caching.
No I meant either bloom or xor, not both. 😉 On top of that I would still like to keep CSV because removing cache entries can be done with a simple editor and it's great for debugging (and compresses well). Does that make sense? |
The only issue with CSV is the redundant timestamp storage. [[success]]
time = 1979-05-27T07:32:00-08:00
requests = [ ... ]
[[success]]
time = 1980-05-27T07:32:00-08:00
requests = [ ... ] But then it's not easily " |
that's fair. it's still possible to do it with filesystem (see how clustering with afl works), but redis seems much easier. |
@untitaker, your comment got me thinking:
For CSV, we could store the cache files like so
The upside would be that we only have to store the timestamp once — in the filename. I think that was also @MichaIng's idea IIRC. The downside would be that we have to manage multiple cache files and think about cleaning them up to not waste disk space. Also we'd have to merge the files on startup, so maybe it's more trouble than it's worth. |
Note that due to block sizes, usually multiple files are magnitude of orders larger than redundant timestamps. Not sure if this is a big issue to have some KiB more? |
I'm okay with adding the timestamps to every entry. If we use Unix timestamps, they are quite compact anyway. If we use full RFC3339 dates, then they are a little longer, but also human-readable. Then again I don't think many people will edit the file manually, let alone based on the request time, so it shouldn't matter. |
how about a single file that contains tuples/csv rows:
a cache row is expired when since this puts a theoretical upper bound on the number of rows: there can only be this is not for the sake of microoptimization, but to get rid of the file-size question. because things are now microoptimized, you can now choose |
Since you mentioned timestamp and cluster, it's worth mentioning that time format such as
|
Guess a UNIX timestamp would be the easiest as it's guaranteed to be UTC. |
Question: if |
I think the existence of --cache (an option to pass cache path) pretty much implies there's no magic filenames that lychee picks up automatically
…On Wed, Jan 12, 2022, at 23:28, Matthias wrote:
Question: if `--cache` is *not* set but the `.lycheecache` file exists, should we still load it? I guess not, but I wanted to hear people's thoughts.
—
Reply to this email directly, view it on GitHub <#443 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRIKEPJGA2ZTWPUY67TUVX6A3ANCNFSM5LIQVLSQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
Added the timestamp functionality we talked about. @untitaker, your approach is superior and quite clever, but I decided to keep it simple for now and just add a timestamp to every entry. It's wasteful, but quite easy to explain and most people wouldn't use a CSV file anyway if they had space constraints. As for the bloomfilter / xor filter, I quite frankly didn't get far.
however the trait object doesn't have a size. Other than that I'd say we could merge this from my side. It's an optional feature anyway, so let's not sweat it. If anyone has the time to give it a spin, I'd be happy for any final comments. |
looks good enough to merge. no need to finish all features in this PR. |
why this gets recommended now is beyond me
A while ago, caching was removed due to some issues (see #349).
This is a new implementation with the following improvements:
collect_links
instances, which probably was an issue before.Mutex<HashMap>
in my tests.CacheStatus
for serialization, because trying to serialize the error kinds inStatus
turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either).TODO
Open points that evolved during our discussion:
--cache
flag to enable.Support different cache formats. E.g. a compact format using bloomfilters/xor filters.Can't get it to work, see Add caching functionality (v2) #443 (comment).Follow-up after merge
Future improvements
#[cache]
attribute.--format=detailed
is activated.Fixes #348
@MichaIng @lebensterben @pawroman @untitaker fyi