Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowed memory size of X bytes exhausted in PhpSpreadsheet\Worksheheet\Worksheet.php on line 3108 #3814

Closed
1 of 8 tasks
petruchek opened this issue Dec 7, 2023 · 4 comments · Fixed by #3834
Closed
1 of 8 tasks

Comments

@petruchek
Copy link

petruchek commented Dec 7, 2023

This is:

- [X] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

The spreadsheet array containing the textual data from the spreadsheet is initialized.

What is the current behavior?

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 36864 bytes) in PhpSpreadsheet\Worksheheet\Worksheet.php on line 3108

What are the steps to reproduce?

<?php

require_once './vendor/autoload.php';

$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load('./example.xlsx');

$names = $spreadsheet->getSheetNames();
$result = [];
for($i=0;$i<$spreadsheet->getSheetCount();$i++)
{
    $result[$names[$i]] = $spreadsheet->getSheet($i)->toArray();
}

print_r($result);

If this is an issue with reading a specific spreadsheet file, then it may be appropriate to provide a sample file that demonstrates the problem; but please keep it as small as possible, and sanitize any confidential information before uploading. File attached below.

What features do you think are causing the issue

  • Reader
  • Writer
  • Styles
  • Data Validations
  • Formula Calculations
  • Charts
  • AutoFilter
  • Form Elements

Does an issue affect all spreadsheet file formats? If not, which formats are affected?

This happens for a specific file, I am attaching it here: example.xlsx

Which versions of PhpSpreadsheet and PHP are affected?

I have tested with 1.29 on PHP 7.4 (BTW, I thought 7.4 is no longer supported?) and PHP 8.2 - happens on both.

I've looked into rangeToArray function a little bit, this file has range of A1:AMJ1048576, and when we are looping
for ($col = $minCol; $col !== $maxCol; ++$col) { we never reach $maxCol value of AMK

I understand the file is crazy, yet I believe Reader should not go into infinite loop even for a crazy file like this.

@gagsy
Copy link

gagsy commented Dec 19, 2023

Hi petruchek,

I have tested the example.xlsx(only data in 1 row) on Php 8.2.
Steps

  • installed latest version of "phpoffice/phpspreadsheet": "^1.29"
  • Get data using
  • $spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load('example.xlsx');
  • $names = $spreadsheet->getSheetNames();
  • $sheet = $spreadsheet->getSheet(0);
  • $rowData = $sheet->rangeToArray('A1:Z1', null, true, false)[0];
  • print_r($rowData);

Let me know if this works for you or You can assign me this issue

Thanks

@sj-i
Copy link
Contributor

sj-i commented Dec 19, 2023

Hi!

Actually, it's not an infinite loop. Some kind of optimization and / or fix should be made though.

If you use PHP 8.2 or later, and set the huge memory_limit like 24G, toArray() itself can be completed.
But the part holding the result and printing out them also consume huge amount of memories, so even 50G is not enough for the test script in the original issue.

I'm here from the GitHub search to test the new version of my memory profiler (Reli 0.11.0). "Allowed memory size of" is my personal trending word these days. I've done some investigation about this issue, so I share it.

Test script

composer.json

{
    "require": {
        "phpoffice/phpspreadsheet": "^1.29"
    }
}

analyzer.php

<?php
register_shutdown_function(
    function (): void {
        $error = error_get_last();
        if (is_null($error)) {
            return;
        }
        if (strpos($error['message'], 'Allowed memory size of') !== 0) {
            return;
        }
        $pid = getmypid();
        $file_opt = '--memory-limit-error-file=' . escapeshellarg($error['file']);
        $line_opt = '--memory-limit-error-line=' . escapeshellarg($error['line']);
        system("sudo /home/sji/work/reli-prof/reli i:m -p {$pid} --no-stop-process {$file_opt} {$line_opt} >memory_analyzed.json");
    }
);

test.php

<?php

require_once './vendor/autoload.php';

$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load('./example.xlsx');

$names = $spreadsheet->getSheetNames();
$result = [];
for($i=0;$i<$spreadsheet->getSheetCount();$i++)
{
    foreach ($spreadsheet->getSheet($i)->toArray() as $row) {
        print_r($row);
    }
}

Run with 8M limit

$ php -dmemory_limit=8M -dauto_prepend_file=analyzer.php test.php 
PHP Fatal error:  Allowed memory size of 8388608 bytes exhausted (tried to allocate 20480 bytes) in /home/sji/work/oss/tmp/phpspreadsheet_test/vendor/phpoffice/phpspreadsheet/src/PhpSpreadsheet/Worksheet/Worksheet.php on line 3107

$ ls -la memory_analyzed.json 
-rw-rw-r-- 1 sji sji 7318949 Dec 19 04:38 memory_analyzed.json

Summary

$ cat memory_analyzed.json | jq .summary
[
  {
    "zend_mm_heap_total": 8388608,
    "zend_mm_heap_usage": 7266520,
    "zend_mm_chunk_total": 8388608,
    "zend_mm_chunk_usage": 7266520,
    "zend_mm_huge_total": 0,
    "zend_mm_huge_usage": 0,
    "vm_stack_total": 262144,
    "vm_stack_usage": 1088,
    "compiler_arena_total": 2011608,
    "compiler_arena_usage": 52872,
    "possible_allocation_overhead_total": 953990,
    "possible_array_overhead_total": 254560,
    "memory_get_usage": 7298856,
    "memory_get_real_usage": 8388608,
    "cached_chunks_size": 0,
    "heap_memory_analyzed_percentage": 99.55697166788877,
    "php_version": "v82",
    "analyzer": "reli 0.11.0"
  }
]

Reli successfully analyzes the 99.55% of the heap usage.

Extracting top 20 most memory-consuming types of memory locations

$ cat memory_analyzed.json | jq .location_types_summary | jq -r '(["location_type", "count", "memory_usage"] | (., map(length*"="))),(to_entries|.[:20]|.[]|[.key,.value.count,.value.memory_usage])|@tsv' | column -t -o ' | '
location_type                        | count | memory_usage
=============                        | ===== | ============
ZendArrayTableMemoryLocation         | 776   | 3737712
ZendArrayTableOverheadMemoryLocation | 775   | 113200
ZendObjectMemoryLocation             | 697   | 109640
RuntimeCacheMemoryLocation           | 391   | 52880
ZendArrayMemoryLocation              | 778   | 43568
ZendStringMemoryLocation             | 418   | 25298
ObjectsStoreMemoryLocation           | 1     | 8192
StaticMembersTableMemoryLocation     | 16    | 928
CallFrameVariableTableMemoryLocation | 5     | 688
CallFrameHeaderMemoryLocation        | 5     | 400
ZendReferenceMemoryLocation          | 5     | 160
ZendResourceMemoryLocation           | 3     | 72

Array data consumes the most memory.

Extracting the references of top 20 largest arrays

$ cat memory_analyzed.json | jq '. as $root | path(..|objects|select(."#type"=="ArrayHeaderContext"))| . as $path | $root|getpath($path) as $header | $header.array_elements as $elements | {path: $path|join("."), size: $elements."#locations"[0].size, count: $elements."#count", node_id:$header."#node_id"}' | jq -rs '(["size", "count", "node_id" ,"path"] | (., map(length*"="))),(sort_by(.size) | .[-20:] | reverse | .[] | [.size, .count, .node_id, .path])|@tsv' | column -t -o ' | '
size  | count | node_id | path
====  | ===== | ======= | ====
24960 | 524   | 475555  | context.class_table.phpoffice\\phpspreadsheet\\calculation\\calculation.static_properties.phpSpreadsheetFunctions
16392 | 1024  | 452757  | context.call_frames.2.local_variables.returnValue.array_elements.218.value
16392 | 1024  | 450705  | context.call_frames.2.local_variables.returnValue.array_elements.217.value
16392 | 1024  | 448653  | context.call_frames.2.local_variables.returnValue.array_elements.216.value
16392 | 1024  | 446601  | context.call_frames.2.local_variables.returnValue.array_elements.215.value
16392 | 1024  | 444549  | context.call_frames.2.local_variables.returnValue.array_elements.214.value
16392 | 1024  | 442497  | context.call_frames.2.local_variables.returnValue.array_elements.213.value
16392 | 1024  | 440445  | context.call_frames.2.local_variables.returnValue.array_elements.212.value
16392 | 1024  | 438393  | context.call_frames.2.local_variables.returnValue.array_elements.211.value
16392 | 1024  | 436341  | context.call_frames.2.local_variables.returnValue.array_elements.210.value
16392 | 1024  | 434289  | context.call_frames.2.local_variables.returnValue.array_elements.209.value
16392 | 1024  | 432237  | context.call_frames.2.local_variables.returnValue.array_elements.208.value
16392 | 1024  | 430185  | context.call_frames.2.local_variables.returnValue.array_elements.207.value
16392 | 1024  | 428133  | context.call_frames.2.local_variables.returnValue.array_elements.206.value
16392 | 1024  | 426081  | context.call_frames.2.local_variables.returnValue.array_elements.205.value
16392 | 1024  | 424029  | context.call_frames.2.local_variables.returnValue.array_elements.204.value
16392 | 1024  | 421977  | context.call_frames.2.local_variables.returnValue.array_elements.203.value
16392 | 1024  | 419925  | context.call_frames.2.local_variables.returnValue.array_elements.202.value
16392 | 1024  | 417873  | context.call_frames.2.local_variables.returnValue.array_elements.201.value
16392 | 1024  | 415821  | context.call_frames.2.local_variables.returnValue.array_elements.200.value

There seems to be a large number of 1024-element arrays as elements of the array $returnValue in local variables in the 3rd call frame.

Extracting the call trace

$ cat memory_analyzed.json | jq -r '(["frame_no", "function", "line"] | (., map(length*"="))),(path(.context.call_frames[]|objects) as $path | [$path[2], getpath($path).function_name, getpath($path).lineno])|@tsv' | column -t
frame_no  function                                                                 line
========  ========                                                                 ====
0         system                                                                   -1
1         {closure}(/home/sji/work/oss/tmp/phpspreadsheet_test/analyzer.php:3-15)  14
2         PhpOffice\\PhpSpreadsheet\\Worksheet\\Worksheet::rangeToArray            3107
3         PhpOffice\\PhpSpreadsheet\\Worksheet\\Worksheet::toArray                 3214
4         <main>                                                                   12

The executing function of the 3rd call frame is PhpOffice\PhpSpreadsheet\Worksheet\Worksheet::rangeToArray().

Extracting some local variables

$ cat memory_analyzed.json | jq '.context.call_frames."2".local_variables|{row, minRow, maxRow, col, minCol, maxCol, range, rangeStart, rangeEnd}'
{
  "row": {
    "#node_id": 455852,
    "#type": "ScalarValueContext",
    "value": 220
  },
  "minRow": {
    "#reference_node_id": 455842
  },
  "maxRow": {
    "#reference_node_id": 455849
  },
  "col": {
    "#node_id": 455855,
    "#type": "StringContext",
    "#locations": [
      {
        "address": 140454052395712,
        "size": 26,
        "refcount": 1,
        "type_info": 22,
        "value": "SS"
      }
    ]
  },
  "minCol": {
    "#reference_node_id": 5212
  },
  "maxCol": {
    "#node_id": 455850,
    "#type": "StringContext",
    "#locations": [
      {
        "address": 140454065514368,
        "size": 27,
        "refcount": 1,
        "type_info": 22,
        "value": "AMK"
      }
    ]
  },
  "range": {
    "#node_id": 5417,
    "#type": "StringContext",
    "#locations": [
      {
        "address": 140454065211784,
        "size": 37,
        "refcount": 1,
        "type_info": 22,
        "value": "A1:AMJ1048576"
      }
    ]
  },
  "rangeStart": {
    "#node_id": 455836,
    "#type": "ArrayHeaderContext",
    "#locations": [
      {
        "address": 140454052645000,
        "size": 56,
        "refcount": 1,
        "type_info": 3221336071
      }
    ],
    "possible_unused_area": {
      "#node_id": 455837,
      "#type": "ArrayPossibleOverheadContext",
      "#locations": [
        {
          "address": 140454052589032,
          "size": 96,
          "used_location": {
            "address": 140454052588992,
            "size": 40,
            "is_packed": true
          }
        }
      ]
    },
    "array_elements": {
      "#node_id": 455838,
      "#type": "ArrayElementsContext",
      "#locations": [
        {
          "address": 140454052588992,
          "size": 40,
          "is_packed": true
        }
      ],
      "#count": 2,
      "0": {
        "#node_id": 455839,
        "#type": "ArrayElementContext",
        "value": {
          "#node_id": 455840,
          "#type": "ScalarValueContext",
          "value": 1
        }
      },
      "1": {
        "#node_id": 455841,
        "#type": "ArrayElementContext",
        "value": {
          "#node_id": 455842,
          "#type": "StringContext",
          "#locations": [
            {
              "address": 94044376008848,
              "size": 25,
              "refcount": 2,
              "type_info": 342,
              "value": "1"
            }
          ]
        }
      }
    }
  },
  "rangeEnd": {
    "#node_id": 455843,
    "#type": "ArrayHeaderContext",
    "#locations": [
      {
        "address": 140454065743336,
        "size": 56,
        "refcount": 1,
        "type_info": 3221335047
      }
    ],
    "possible_unused_area": {
      "#node_id": 455844,
      "#type": "ArrayPossibleOverheadContext",
      "#locations": [
        {
          "address": 140454052531208,
          "size": 96,
          "used_location": {
            "address": 140454052531168,
            "size": 40,
            "is_packed": true
          }
        }
      ]
    },
    "array_elements": {
      "#node_id": 455845,
      "#type": "ArrayElementsContext",
      "#locations": [
        {
          "address": 140454052531168,
          "size": 40,
          "is_packed": true
        }
      ],
      "#count": 2,
      "0": {
        "#node_id": 455846,
        "#type": "ArrayElementContext",
        "value": {
          "#node_id": 455847,
          "#type": "ScalarValueContext",
          "value": 1024
        }
      },
      "1": {
        "#node_id": 455848,
        "#type": "ArrayElementContext",
        "value": {
          "#node_id": 455849,
          "#type": "StringContext",
          "#locations": [
            {
              "address": 140454065658240,
              "size": 31,
              "refcount": 2,
              "type_info": 22,
              "value": "1048576"
            }
          ]
        }
      }
    }
  }
}

This can be read as like the following:

$minRow === '1';
$row === 220;
$maxRow === '1048576';
$minCol === 'A';
$col === 'SS';
$maxCol === 'AMK';
$range === 'A1:AMJ1048576';
$rangeStart === [1, '1']:
$rangeEnd === [1024, '1048576']:

So, it tries to create 1,048,576 arrays each having 1,024 elements.

The size of zval in PHP is 16 bytes each. Each element in packed arrays has an overhead of extra 16 bytes in PHP 8.1 and earlier, while in PHP 8.2+ they only require the same size as the zval.

So at least 16 GB is required for PHP 8.2 and later, and 32 GB for PHP 8.1 and earlier, just to hold the contents of this array $returnValue.

A possible optimization

Currently, rangeToArray() first initialises the array element for each row with $nullValue and only replaces it if the cell corresponding to that position actually has a value.

PHP arrays are CoW, so it's easy to dramatically reduce memory consumption for cases like this.

First prepare an array filled with $nullValue for the number of columns we need. By initialising the result array for each row of the sheet with that array first, and only rewriting the rows where the cell actually has a value, we can deduplicate the memory space for all rows where its cell doesn't have a value.

Then the test.php in this comment can run wih 32M of the memory_limit now.

I'll send a PR later.

Is this really a problem to be solved in that way?

May or may not be. I don't know the requirement of this library well.

This xlsx file has attribute settings in the AME to AMJ columns and also in the row number 1048576. This is what causes $maxRow and $maxCol to be expanded so far.

The actual range of data in the cells is much smaller, so it is questionable whether toArray() is right to try to handle such a wide range of data for this file in the first place.

Nevertheless, there may be cases where some data is actually located "far away", so it would be not so bad to include such an optimization anyway.

@sj-i
Copy link
Contributor

sj-i commented Dec 19, 2023

Oh, did another person see this issue? Anyway, I'll send a PR to reduce memory usage in cases where there are many empty rows.

@petruchek
Copy link
Author

Then the test.php in this comment can run wih 32M of the memory_limit now.

I'll send a PR later.

Is this really a problem to be solved in that way?

May or may not be. I don't know the requirement of this library well.

That's quite an analysis. Impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants