Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high memory occupy when reading 100w lines excel file with latest master version #1096

Closed
aceshot opened this issue Dec 17, 2021 · 6 comments

Comments

@aceshot
Copy link

aceshot commented Dec 17, 2021

参考这个:https://github.com/qax-os/excelize/issues/845#issuecomment-922466641。

测试使用了master最新代码(支持临时文件和close版本),但是测试下来,75M左右的100w的Excel文件(单sheet单列 都是文本,64字节的256字符串),内存分配看着累计是有4G,最终持有400M,都挺高的。然后里面提到的93.7%降低,我这边似乎没体会到,是哪里使用有误?还是说因为excel自带很多元数据信息、自带压缩,所以75M文件加载内存就必然会有400M+,有优化空间?

2G内存的ECS测试机,读取过程中 内存立马就90%以上。业务逻辑上,因为行数比较多,for rows.Next()循环读取,外加一些计算处理,整个过程比较花时间。然后发现,读取&处理过程中,内存分配实际都是被进程占有和统计(在HeapInuse),最终体现在进程的RSS中,过程中持续会很高,然后触发告警 (然后内存的降低需要等整个文件读取处理完成 + GO GC释放返回OS)


1、本地测试代码:
增加了:
1)defer profile.Start(profile.MemProfile).Stop()
2)做总数统计

import (
	"fmt"
	"github.com/pkg/profile"
	"strconv"
	"testing"

	"github.com/xuri/excelize/v2"
)

func Test_excel_read_with_100w_lines(t *testing.T) {

	defer profile.Start(profile.MemProfile).Stop()

	f, err := excelize.OpenFile("resources/excel-one-column-with-sha256-100w-lines.xlsx")
	if err != nil {
		fmt.Println(err)
		return
	}
	rows, err := f.Rows(f.GetSheetList()[0])
	if err != nil {
		fmt.Println(err)
		return
	}
	var cnt int64 = 0
	for rows.Next() {
		_, err := rows.Columns()
		if err != nil {
			fmt.Println(err)
		}
		//for _, colCell := range row {
		//	fmt.Print(colCell, "\t")
		//}
		//fmt.Println()
		cnt++
	}
	fmt.Println("total size: "+strconv.FormatInt(cnt, 10))
	// Close the stream
	if err = rows.Close(); err != nil {
		fmt.Println(err)
	}
	// Close the workbook
	if err := f.Close(); err != nil {
		fmt.Println(err)
	}
}

2、 defer profile.Start(profile.MemProfile).Stop()输出pprof文件
inuse_space: 400M
image

allocate_space: 4G
image

@aceshot aceshot changed the title 依赖master代码读取100w的Excel表格,内存占用依旧非常高 依赖master代码读取100w行的Excel表格,内存占用依旧非常高 Dec 17, 2021
@aceshot aceshot changed the title 依赖master代码读取100w行的Excel表格,内存占用依旧非常高 high memory occupy when reading 100w lines excel file with latest master version Dec 17, 2021
@xuri
Copy link
Member

xuri commented Dec 18, 2021

Thanks for your feedback, the master branch code support unzip worksheet as temporary files to reduce memory usage, but some worksheets cell not using inline value, the cell value has been storage in the SST (shared string table), and Excelize doesn't support streaming reading/writing SST currently, the memory usage of reading some spreadsheet maybe still seem high, could you provides the attachemts without confidential info, and I can analytics and improvement in the future.

@aceshot
Copy link
Author

aceshot commented Dec 19, 2021

非常感谢作者能这么快回复哈!
1)测试文件是100w行的Excel文件,就单sheet单列。单列的值是一个sha256的字符串。测试文件我是用下面的程序先生成了txt,然后就导入到了excel中。

func Test_generate_sha256_excel(t *testing.T) {

	var len int64 = 1000000
	var startNum int64 = 18010000000

	fileRelativePath := "resources/excel-one-column-with-sha256-100w-lines.txt"

	var file *os.File
	var err error
	if _, err := os.Stat(fileRelativePath); os.IsNotExist(err) {
		//不存在 则创建
		file, err = os.Create(fileRelativePath)
	} else {
		//存在就覆盖写入
		file, err = os.OpenFile(fileRelativePath, os.O_CREATE|os.O_TRUNC|os.O_RDWR, fs.ModeAppend)
	}
	defer file.Close()
	if err != nil {
		panic(err)
	}

	for ; len > 0; len-- {
		num := strconv.FormatInt(startNum, 10)

		sha256 := sha256.Sum256([]byte(num))

		file.WriteString(hex.EncodeToString(sha256[:]) + "\n")

		startNum++
	}
}

2)『some worksheets cell not using inline value, the cell value has been storage in the SST (shared string table), and Excelize doesn't support streaming reading/writing SST』
也搜索了下,excel创建似乎默认使用的sst(包括 txt导入Excel,我的表格中就是类似如下的内容,单元格选择的是文本类型),没办法指定走inline value。应该是从压缩(降低大小)的角度考虑的。参考另一个:monitorjbl/excel-streaming-reader#35

1bd6abc4d671d2bae9126d0dbbd21c1ff7dfcac328424ebe7f4f2e1177002bb7
2f8f3b529f47732f75127c61817be9b5af111799b4b5a3cbd3ff126e46b44d12
d854f0d4f8206eca1c7c6232905189bd25fc4352f860ad1eaf8740caf78a4fa5

@xuri
Copy link
Member

xuri commented Dec 20, 2021

If you're using Excel and other spreadsheet applications, and when you create a spreadsheet with excelize common API, the cell value with string data types will be stored in the SST by default, but if you're using the streaming API to create a spreadsheet with the streaming writer, all cell value will be stored as inline. I'll consider improving reading performance on working with SST in the next.

@xuri xuri closed this as completed in 00386c7 Dec 27, 2021
xuri added a commit that referenced this issue Dec 27, 2021
- Unzip shared string table to system temporary file when large inner XML, reduce memory usage about 70%
- Remove unnecessary exported variable `XMLHeader`, we can using `encoding/xml` package's `xml.Header` instead of it
- Using constant instead of inline text for default XML path
- Rename exported option field `WorksheetUnzipMemLimit` to `UnzipXMLSizeLimit`
- Unit test and documentation updated
@xuri
Copy link
Member

xuri commented Dec 28, 2021

I have optimize memory usage, now support unzip shared string table to system temporary file when large inner XML, reduce memory usage about 60%, please try to upgrade to the master branch code, and this feature will be released in the next version.

@aceshot
Copy link
Author

aceshot commented Dec 29, 2021

I have optimize memory usage, now support unzip shared string table to system temporary file when large inner XML, reduce memory usage about 60%, please try to upgrade to the master branch code, and this feature will be released in the next version.

非常感谢,拉取了master代码 重新跑了下上 之前100w行的数据读,内存inused 从470M->175M 降低了很多 (分配的total数据 3.65G也降低了一些)

@xuri
Copy link
Member

xuri commented Jan 11, 2022

Hi @aceshot, I further reduce memory usage for reading SST by about 50%, note that time cost will increase by 30%, if you like please upgrade to the master branch code.

jenbonzhang pushed a commit to jenbonzhang/excelize that referenced this issue Oct 22, 2023
- Unzip shared string table to system temporary file when large inner XML, reduce memory usage about 70%
- Remove unnecessary exported variable `XMLHeader`, we can using `encoding/xml` package's `xml.Header` instead of it
- Using constant instead of inline text for default XML path
- Rename exported option field `WorksheetUnzipMemLimit` to `UnzipXMLSizeLimit`
- Unit test and documentation updated
jenbonzhang pushed a commit to jenbonzhang/excelize that referenced this issue Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants