-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangeToArray tries to read every possible cell leading to OOM #3982
Comments
Not a bug, and nothing to do with the Reader. The Reader reads the spreadsheet file, and the values returned for a worksheet by |
@tommasoiovane PR #3839 (part of release 2.0.0) and PR #3906 (part of release 2.1.0) address performance problems in toArray. I think they significantly help your issue. The code below is identical to the code you supplied, except for the addition of the echo statements. echo "starting read ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("issue.3982.xlsx");
echo "starting toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
$data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);
echo "ending toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
echo "peak memory is ", memory_get_peak_usage(),"\n"; Output:
This should be nowhere close to OOM on most systems - I believe PHP ships with a configurable memory limit of 128M, more than 3 times the memory usage of this script. My results are similarly good with 2.0.0 and 2.1.0, and I agree they are much worse on 1.29.0 for both memory and time. Please try again with the newest release. |
I don't know how they got there, but the following lines in the xml are the source of the problem: <row r="1048570" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048571" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048572" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048573" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048574" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048575" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048576" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/> In other words, as far as this spreadsheet is concerned, rows 1048570-1048576 exist. The processing of the row attributes is entirely separate from the processing of the cells, which is why IGNORE_EMPTY_CELLS had no effect. Now, since there are no cells associated with those rows, it might be possible to have Xlsx Reader optionally ignore these rows. It's not something we can do automatically (that would be a breaking change), and we probably don't want to use IGNORE_EMPTY_CELLS (ditto), but it might be possible to add a new option to the read filter or to Xlsx reader. I will think about it. |
Fix PHPOffice#3982. A number of issues submitted about Xlsx read performance have a common theme, namely that row 1,048,576 and a few rows before it are defined in the worksheet Xml with no cells attached to them. These might be the work of a third party product. While these extraneous rows do not cause any problems for the cells that are actually used on the worksheet, they can lead to excessive memory use. This PR provides an option for the application to ignore rows with no cells when loading. Recent changes to the load logic had already made a significant difference to memory consumption and load time. For the spreadsheet attached to issue 3982, which had caused out-of-memory errors on the user's system, peak memory usage was already reduced to 40-odd MB. With the new option, this is drastically reduced again, to just over 9MB. Specifying the new option is very easy: ```php $reader->setIgnoreRowsWithNoCells(true); ``` Note that there are cases where you might not want this (non-default) behavior. For example, if you set a row height on a row with no cells, the height would be lost with this option. Unfortunately, the extraneous row definitions in the problematic spreadsheets claim to have a custom height, so I can't just use "no custom row styles" as an additional filter.
This is:
What is the expected behavior?
I would expect
readActiveSheet
to return only cells with value or something in it.What is the current behavior?
With certain xlsx files rangeToArray calculate $maxCol as "AMK" and $maxRow as "1048579" leading to very slow performance and OOM memory issues.
What are the steps to reproduce?
This code will fail running out of memory. I managed to read the xls file changing Worksheet.php rangeToArray like this:
Is there any reason why it should'nt be like this?
What features do you think are causing the issue
Which versions of PhpSpreadsheet and PHP are affected?
I tried both on 1.3 and 2.0.0
Thank you in advance
test_file.xlsx
The text was updated successfully, but these errors were encountered: