Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache prefix should/could be optional #823

Closed
matt-allan opened this issue Dec 18, 2018 · 5 comments
Closed

Cache prefix should/could be optional #823

matt-allan opened this issue Dec 18, 2018 · 5 comments
Labels
BREAKING CHANGE pinned pinned issue to avoid them becoming stale

Comments

@matt-allan
Copy link

matt-allan commented Dec 18, 2018

This is:

- [ ] a bug report
- [x] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

The cell cache does not require a prefix.

What is the current behavior?

All coordinates are prefixed automatically by PhpOffice\PhpSpreadsheet\Collection\Cells

What are the steps to reproduce?

Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:

<?php

require __DIR__ . '/vendor/autoload.php';

class Memory extends \PhpOffice\PhpSpreadsheet\Collection\Memory
{
    public function set($key, $value, $ttl = null)
    {
        fwrite(STDOUT, "Cache key: {$key}\n"); // prints something like "Cache key: phpspreadsheet.5c19219c3c4962.46848305.A1"
        return parent::set($key, $value, $ttl);
    }
}

\PhpOffice\PhpSpreadsheet\Settings::setCache(new Memory());

$spreadsheet = new \PhpOffice\PhpSpreadsheet\Spreadsheet();
$spreadsheet->getActiveSheet()->setCellValue('A1', 'foo')->setCellValue('A2', 'bar');

Which versions of PhpSpreadsheet and PHP are affected?

Any version >= 1.0.0-beta

Description

The class PhpOffice\PhpSpreadsheet\Collection\Cells automatically adds a cache prefix to every key before adding it to the cache. The prefix looks like phpspreadsheet.5c19219c3c4962.46848305..

The problem with prefixing is it ends up requiring a lot more memory if the cache adapter stores the keys in memory. The immediately obvious issue is that we have to store the extra 39 characters for the prefix, repeated by the number of cells in the spreadsheet. However the memory usage is even worse because Cells already stores an index of every coordinate. If the cache keys were not prefixed PHP could reference the same value for both Cells::index and the Cache implementation's index. Since the prefix is added PHP stores each coordinate twice, once without the prefix and once with the prefix.

I think this could be avoided by letting the cache determine if it wants to add a prefix or not. Then the cache adapter could either namespace the keys itself or the user could use a decorator like this one. It seems like the intention of the prefix is to avoid collisions and the cache should know best if it needs to worry about that.

When using the in memory cache you could avoid collisions by creating a new instance of the memory cache per worksheet. This would require changing Settings::getCache to not force a singleton since it currently returns a single cache instance for all worksheets. We would probably need to let the user specify a factory instead?

I tried working around this by making the cache adapter index the cache keys in a multidimensional array as [prefix][coordinate]. It helped memory usage but hurt performance because of all of the string operations necessary.

I benchmarked performance without prefixing and saw a 21% improvement in runtime and 22% improvement in memory usage. However there is only a 8.63% improvement in runtime and 14% improvement in memory usage if the changes in #822 are made first.

@MarkBaker
Copy link
Member

Did you try working with more than one Spreadsheet object concurrently?

@matt-allan
Copy link
Author

Not in my benchmark script. It wouldn't have worked correctly because Settings is currently using a single cache entry for every spreadsheet. I outlined some of my thoughts for working around that in the description above.

@matt-allan
Copy link
Author

As I mentioned in #629 removing the PSR cache support entirely cuts the memory usage in half[1] (when profiling with Blackfire, it only saves 30% according to memory_get_usage). I'm guessing this is a combination of the prefixed strings, the cost of having two arrays of coordinates, and the memory allocated when getAllCacheKeys is called. This benchmark was loading a lot of cells with small values; I'm sure the improvement wouldn't be as drastic if you were loading fewer cells with larger values.

[1] https://blackfire.io/profiles/compare/944b11fd-62fb-40d2-8f8f-e45c46d169c6/graph

@PowerKiKi
Copy link
Member

the prefix is to avoid collisions and the cache should know best if it needs to worry about that

I agree with that, and as performance is a constant struggle with PhpSpreadsheet, we should strive to optimize the common use cases. But we must guarantee that default setup still allow concurrent usage (within same process or different processes). We would also need to document very clearly for advanced users using custom caches that avoiding collision is up to them.

The factory idea sounds good to me. Typically that would allow for a custom factory to generate a unique prefix for each instances if needed. Or implement other way to separate caches (eg: use a different directory for a cache writing to disk).

It makes me wonder if we should integrate with PSR container. I've been thinking about containers before. At least the IOFactory could benefit from that. Maybe other things too. But if we go that way, we definitely should not raise the bar for newcomers and keep everything straightforward for default use-case.

Whichever route we go, I feel that it would almost always introduce (relatively minor) breaking change. And that makes this work a good candidate for v2.

@stale
Copy link

stale bot commented Mar 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label Mar 3, 2019
@stale stale bot closed this as completed Mar 10, 2019
@PowerKiKi PowerKiKi added pinned pinned issue to avoid them becoming stale and removed stale labels May 26, 2019
@PowerKiKi PowerKiKi reopened this May 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BREAKING CHANGE pinned pinned issue to avoid them becoming stale
Development

No branches or pull requests

3 participants