-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading large File increases Execution time & Memory (E.g file with 500000 records) #629
Comments
I would like to add a bit of clarity to this as well, since we are working together on this project. We are processing rather large files, that are upwards of 300k lines. In order to scale this and not load the whole file into memory at once, we are using the chunkFilter functionality as laid out here: https://phpspreadsheet.readthedocs.io/en/develop/topics/reading-files/#loading-a-spreadsheet-file under Reading Only Specific Columns and Rows from a File (Read Filters) . We did some baseline testing of our script and we're noticing that once we pass the threshold of about 250k records, the What we're basically looking to determine here is how we can proceed further. Is there a known limitation with reading a certain sized file? Are there any steps we can take to make the reader perform better? Below is the code we're using for our Chunk IReadFilter: <?php
use PhpOffice\PhpSpreadsheet\Reader\IReadFilter;
/** Define a Read Filter class implementing IReadFilter */
class Chunk implements IReadFilter
{
private $startRow = 0;
private $endRow = 0;
/**
* Set the list of rows that we want to read.
*
* @param mixed $startRow
* @param mixed $chunkSize
*/
public function setRows($startRow, $chunkSize)
{
$this->startRow = $startRow;
$this->endRow = $startRow + $chunkSize;
}
public function readCell($column, $row, $worksheetName = '')
{
// Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow
if (($row == 1) || ($row >= $this->startRow && $row < $this->endRow)) {
return true;
}
return false;
}
}
?> As well as a trimmed down version of the code we're using to chunk the file: // Create a new Reader of the type defined in $inputFileType
$reader = IOFactory::createReader($inputFileType);
// Define how many rows we want to read for each "chunk"
$chunkSize = 10000;
// Create a new Instance of our Read Filter
$chunkFilter = new Chunk();
// Tell the Reader that we want to use the Read Filter that we've Instantiated
$reader->setReadFilter($chunkFilter);
// Loop to read our worksheet in "chunk size" blocks
for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
// Tell the Read Filter, the limits on which rows we want to read this iteration
$chunkFilter->setRows($startRow, $chunkSize);
// Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
$spreadsheet = $reader->load($inputFileName);
...
} Thank you in advance for your assistance. Brian F (& Hemanth K, the original poster) |
I'm currently also looking into why it takes as much memory to read a file. What I suggest is to install xdebug 2.6 which has memory usage profiling. I've already found some problematic areas in my code which reduced the memory usage somewhat, but it's still way to high. |
Looking for some more guidance on this from the Devs. Maybe they know something directly related that can help us to resolve or at least improve our situation. |
Tried using file system caching of cells... a 411K row file still consumes beyond 1GB of RAM... chunks or no chunks the behavior is roughly the same. What are we missing ? |
0.0 |
We have been playing with this library for a few weeks now, benchmarking and profiling various read and write methods. We have hit several bottlenecks, managed to escape most of them except one: READING LARGE XLSX FILES ! Before we discuss about the XLSX problem in particular, I would like to share a few more things:
Obviously, both methods reduce memory usage but also increase processing time considerably. Here is our sample file: https://drive.google.com/file/d/1DbKD28u46BI761YjdsVelvjFux6YZJl1/view?usp=sharing Whenever load() is called on this file, 8GB of RAM are exhausted immediately regardless of chunk size... and it seams to want even more after exhausting the full RAM available as it started using swap.
For now we ended up making use of listWorksheetInfo to detect the total number of rows and columns and abandon the whole process if that number is too big. Method listWorksheetInfo seams to be able to handle the file quite well, but once you try to load the file everything blows up. In total we have: 161 columns * 100K rows = 16.1 million cells. After converting to XLS (which cannot go beyond 65536 rows) we are also able to process the 65K rows without exceeding 500 MB of RAM. Any help on this matter would be highly appreciated. |
Thanks @bferrantino and @pop-mihai for those detailed reports. I unfortunately cannot give you a quick and easy solution. Because I don't know PhpSpreadsheet's ugly details well enough. But we can attempt a few things. First off, @pop-mihai you mentionned that CSV is successful with 16 millions cells. The first thing that comes to mind is that CSV does not have any styling at all. Have you tried commenting out all style-related code in the XLSX reader ? Does it make any difference ? Were you able to do memory profiling with xdebug as @dkarlovi suggested ? were you able to pinpoint something specific that would consume especially a lot of memory ? There is #648 that might help you. We need somebody to confirm the memory improvement and create a proper PR to merge it. Would one of you be able to help ? |
Hello @PowerKiKi , thank you for the input. We did some memory profiling yes, but at this volume it is somewhat hard to point any fingers. Biggest memory consuming process reported by xdebug is caused by the preg_match within the static coordinateFromString method which is weird. We also looked at other libraries, including box/spout which is very promising due to its streaming mechanism - but suffers the same problem, memory leaks with xlsx, however not at this scale.
We believe that there is a memory leak in the compiled libxml version. With PHP 7.2, the compiled libxml version is: 2.9.1 while on xmlsoft.org the latest version is: 2.9.7 released Nov 02 2017. If you look at the change log for keyword "memory" between the 2 versions you`ll see over 25 fixes! We will attempt to compile our php instance with the latest libxml version and see if this makes a notable difference. Then we`ll have to choose the best tool for each need (csv, xls, xlsx). |
This is not weird at all, the functionality is used for each cell as it's being read. I had the same problem. |
If that preg_match can indeed be confirmed to be a bottleneck, it could perhaps be replaced by a very simple parsing process instead. Since the pattern is actually quite simple, the parsing code would be rather straightforward and could maybe save time and/or memory ? Anyway, as you know PhpSpreadsheet is not known for its speed or memory consumption. So every effort to improve the situation would be very welcome. @pop-mihai you seems to be in a good spot for that, with a strong use-case and good knowledge of the overall issues. Don't hesitate to experiment with PhpSpreadsheet own code and see if it could be merged back in a second step... |
@PowerKiKi I actually started working on this a bit for my use case, I've switched from See sigwinhq/xezilaires@e69afc4#diff-e11287c474a458a580cd0b18582ae5a8R236 |
Does your spreadsheet have many calculated cells? I can import a sheet with around ~200k cells within a few seconds, however a sheet with ~600k cells were many of them are calculated times out after 5 minutes. Here are the major culprits in my profile. Note that it did time out, so it probably could have gotten worse? EDIT: sorry, not thinking straight. It timed out because of xdebug - but my point was that there's definitely a high cost if your sheet has a lot of calculated cells. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This worked for me too. By storing all of the cells in a single cache file and keeping an in memory index only 25% of the memory is used and it isn't noticeably slower than keeping everything in memory (it's actually sometimes faster, I'm not sure why). The length of the generated cache prefix doesn't help memory usage, since it ends up requiring > 39 bytes per cache key. |
After spending some time profiling this I'm fairly certain it's not the If you look at the profile linked to #823 it shows the memory usage of I tried removing the call to If you rewrite A good percentage of the memory usage being blamed on |
@ryzr how did you track the time in xdebug like in screenshots, please can you show me how
and also about issue how can i maximize time for reading excel files, it takes minutes to open 4 mb file. no images in it |
Hi, Hope this helps finding the problem, |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
I'm finding https://github.com/box/spout to be a good way for reading XLSX files in a performant manner. |
@danielbachhuber - yes box/spout due to its streaming mechanism and use of yielding is very efficient, but it will only work for XLSX and CSV. For XLS we couldn't find any alternative (or at least none with a comprehensive feature list as what we are looking for), so we continue to use PHPSpreadSheet in such encounters. Here we managed to lower the memory footprint only by reading in smaller chunks. All attempts to use Cell cache, force freeing of memory e.t.c. have either failed or moved the problem from one side to another. |
Hi, I am still new to this but tried out a solution which help us here : // Define how many rows we want to read for each "chunk"
$chunkSize = 1000;
// Loop to read our worksheet in "chunk size" blocks
for ($startRow = 1; $startRow <= $rawRows; $startRow += $chunkSize) {
// Create a new Reader of the type defined in
$reader = IOFactory::createReader($inputFileType);
// Create a new Instance of our Read Filter
$chunkFilter = new Chunk();
// Tell the Reader that we want to use the Read Filter that we've Instantiated
$reader->setReadFilter($chunkFilter);
// Tell the Read Filter, the limits on which rows we want to read this iteration
$chunkFilter->setRows($startRow, $chunkSize);
// Load only the rows that match our filter from $inputFileName to a PhpSpreadsheet Object
$spreadsheet = $reader->load($inputFileName);
.....
// process the file
.....
// then release the memory
$spreadsheet->__destruct();
$spreadsheet = null;
unset($spreadsheet);
$reader->__destruct();
$reader = null;
unset($reader);
} This helps for large sheets to use only memory of a chunk and never exceed the memory limit. |
helped me, thanks! |
What about $rawRows? How I can get it? |
|
Hi there, How to see these info likes these screenshots ? Thx |
@cuongngoz The screenshots are from X-Debug, Depending on your OS + Editor, you can simply google "Setting up X-debug on with " Here is a sample link |
Thanks @shubhamt619 , I have already setup xdebug with my editor is PhpStorm actually, but just don't know how to see the analyze code calculating times as your screenshot, looks useful btw |
$rawRows : what does this correspond to? |
Thank you for solving my problem
|
why recreating a new $reader and $chunkFilter in each loop ? |
@edeiller-gfi I tested my program and can put it outside the loop
output:
|
In my opinion this issue should be re-opened.
What's needs to be addressed is the huge speed regression between 1.29 and 2.2. I will still use PHPSpreadsheet for writing - but reading needs to be faster to be used in production. The "getCell" is definitively culprit in that situation. |
If you are looking for a different way to read documents, the spout project is now here: https://github.com/openspout/openspout |
PR #4153 has made some significant speed improvements. If you can test against master, please do so. If not, I expect it to be part of a formal release within the next few weeks. There are also some useful suggestions about speeding things up in the discussion of that PR. |
This is: a feature request
What is the expected behavior?
What is the current behavior?
What are the steps to reproduce?
Please provide a Minimal, Complete, and Verifiable example of code that exhibits the issue without relying on an external Excel file or a web server:
Which versions of PhpSpreadsheet and PHP are affected?
The text was updated successfully, but these errors were encountered: