[feat] - additional buffer pool #2829

ahrav · 2024-05-10T21:13:36Z

Description:

This PR introduces a second buffer pool alongside the existing one to optimize the handling of different content types when dealing with archives. The original buffer pool, with a default size of 4kB, will be used for diff content, while the newly added buffer pool will have a larger buffer size of 64kB and will be dedicated to file content.

This change is based on tests conducted across ~120k repositories, which showed that most files handled during archive operations exceeded 4kB. Consequently, there was a significant increase in operations needed to expand the underlying buffer. By maintaining separate buffer pools for diff content and file content, we can reduce the overhead associated with constructing BufferedFileWriters and improve overall performance.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

pkg/buffers/buffer/buffer.go

rosecodym · 2024-05-13T15:34:20Z

pkg/buffers/pool/pool.go

 // Pool of buffers.
 type Pool struct {
 	*sync.Pool
-	bufferSize uint32
+	bufferSize int


Why did you change this?

The main reason is convenience. Using an int instead of a uint32 eliminates the need to cast in all the places where it's used.

pkg/writers/buffer_writer/bufferwriter.go

pkg/writers/buffered_file_writer/bufferedfilewriter.go

rosecodym · 2024-05-13T15:48:16Z

pkg/writers/buffered_file_writer/bufferedfilewriter.go

 	return w
 }

 // NewFromReader creates a new instance of BufferedFileWriter and writes the content from the provided reader to the writer.
 func NewFromReader(r io.Reader, opts ...Option) (*BufferedFileWriter, error) {
+	opts = append(opts, WithBufferSize(Large))


If I'm reading this right, you only ever use the explicit option to use the large buffer pool, and get the default one via default method calls. I think this is a little awkward in a way that could be remediated by making the option a boolean called something like WithUseLargeBuffer. Did you consider something like that?

Yes, I did consider that approach. I thought making it configurable might be appropriate, anticipating additional sizes for future use cases. However, I may have jumped the gun and overcomplicated the solution.

CLAassistant · 2024-05-16T16:25:18Z

All committers have signed the CLA.

…-pool

* Remove specialized handler and archive struct and restructure handlers pkg. * Refactor RPM archive handlers to use a library instead of shelling out * make rpm handling context aware * update test * Refactor AR/deb archive handler to use an existing library instead of shelling out * Update tests * Handle non-archive data within the DefaultHandler * make structs and methods private * Remove non-archive data handling within sources * add max size check * add filename and size to context kvp * move skip file check and is binary check before opening file * fix test * preserve existing funcitonality of not handling non-archive files in HandleFile * Handle non-archive data within the DefaultHandler * rebase * Remove non-archive data handling within sources * Adjust check for rpm/deb archive type * add additional deb mime type * add gzip * move diskbuffered rereader setup into handler pkg * remove DiskBuffereReader creation logic within sources * update comment * move rewind closer * reduce log verbosity * add metrics for file handling * add metrics for errors * make defaultBufferSize a const * add metrics for file handling * add metrics for errors * fix tests * add metrics for max archive depth and skipped files * update error * skip symlinks and dirs * update err * Address incompatible reader to openArchive * remove nil check * fix err assignment * Allow git cat-file blob to complete before trying to handle the file * wrap compReader with DiskbufferReader * Allow git cat-file blob to complete before trying to handle the file * updates * use buffer writer * update * refactor * update context pkg * revert stuff * update test * fix test * remove * use correct reader * add metrics for file handling * add metrics for errors * fix tests * rebase * add metrics for errors * add metrics for max archive depth and skipped files * update error * skip symlinks and dirs * update err * fix err assignment * rebase * remove * Update write method in contentWriter interface * Add bufferReadSeekCloser * update name * update comment * fix lint * Remove specialized handler and archive struct and restructure handlers pkg. * Refactor RPM archive handlers to use a library instead of shelling out * make rpm handling context aware * update test * Refactor AR/deb archive handler to use an existing library instead of shelling out * Update tests * add max size check * add filename and size to context kvp * move skip file check and is binary check before opening file * fix test * preserve existing funcitonality of not handling non-archive files in HandleFile * Handle non-archive data within the DefaultHandler * rebase * Remove non-archive data handling within sources * Handle non-archive data within the DefaultHandler * add gzip * move diskbuffered rereader setup into handler pkg * remove DiskBuffereReader creation logic within sources * update comment * move rewind closer * reduce log verbosity * make defaultBufferSize a const * add metrics for file handling * add metrics for errors * fix tests * add metrics for max archive depth and skipped files * update error * skip symlinks and dirs * update err * Address incompatible reader to openArchive * remove nil check * fix err assignment * wrap compReader with DiskbufferReader * Allow git cat-file blob to complete before trying to handle the file * updates * use buffer writer * update * refactor * update context pkg * revert stuff * update test * remove * rebase * go mod tidy * lint check * update metric to ms * update metric * update comments * dont use ptr * update * fix * Remove specialized handler and archive struct and restructure handlers pkg. * Refactor RPM archive handlers to use a library instead of shelling out * make rpm handling context aware * update test * Refactor AR/deb archive handler to use an existing library instead of shelling out * Update tests * add max size check * add filename and size to context kvp * move skip file check and is binary check before opening file * fix test * preserve existing funcitonality of not handling non-archive files in HandleFile * Adjust check for rpm/deb archive type * add additional deb mime type * update comment * go mod tidy * update go mod * Add a buffered file reader * update comments * use Buffered File Readder * return buffer * update * fix * return * go mod tidy * merge * use a shared pool * use sync.Once * reorganzie * remove unused code * fix double init * fix stuff * nil check * reduce allocations * updates * update metrics * updates * reset buffer instead of putting it back * skip binaries * skip * concurrently process diffs * close chan * concurrently enumerate orgs * increase workers * ignore pbix and vsdx files * add metrics for gitparse's Diffchan * fix metric * update metrics * update * fix checks * fix * inc * update * reduce * Create workers to handle binary files * modify workers * updates * add check * delete code * use custom reader * rename struct * add nonarchive handler * fix break * add comments * add tests * refactor * remove log * do not scan rpm links * simplify * rename var * rename * fix benchmark * add buffer * buffer * buffer * handle panic * merge main * merge main * add recover * revert stuff * revert * revert to using reader * fixes * remove * update * fixes * linter * fix test * move buffers pkg out of writers pkg * rename * [refactor] - move buffer pool logic into own pkg (#2828) * move buffer pool logic into own pkg * fix test * fix test * whoops * [feat] - additional buffer pool (#2829) * move buffer pool logic into own pkg * move * fix test * fix test * fix test * remove * fix test * whoops * revert * fix

ahrav added 8 commits May 10, 2024 13:44

move buffer pool logic into own pkg

dea33d2

move

464bf31

fix test

37c5fc2

fix test

77addfb

update test

0d0a63c

fix test

ac1b90d

remove

967a002

fix test

03c6682

ahrav marked this pull request as ready for review May 10, 2024 22:31

ahrav requested a review from a team as a code owner May 10, 2024 22:31

ahrav requested a review from a team May 10, 2024 22:31

rosecodym reviewed May 13, 2024

View reviewed changes

ahrav and others added 2 commits May 13, 2024 14:05

whoops

c33ba5d

Merge branch 'refactor-pool-logic' into feat-additional-buffer-pool

7efeefb

Base automatically changed from refactor-pool-logic to chore-relocate-buffers-pkg May 13, 2024 21:32

ahrav added 2 commits May 13, 2024 14:40

merge base

7dbcb54

revert

f6683ff

ahrav requested a review from rosecodym May 13, 2024 21:44

merge main

93ac4a8

ahrav requested review from a team May 16, 2024 17:12

Merge branch 'chore-relocate-buffers-pkg' into feat-additional-buffer…

3106235

…-pool

dustin-decker approved these changes May 16, 2024

View reviewed changes

ahrav merged commit 5e3d660 into chore-relocate-buffers-pkg May 16, 2024
9 of 10 checks passed

ahrav deleted the feat-additional-buffer-pool branch May 16, 2024 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] - additional buffer pool #2829

[feat] - additional buffer pool #2829

ahrav commented May 10, 2024 •

edited

Loading

rosecodym May 13, 2024

ahrav May 13, 2024

rosecodym May 13, 2024

ahrav May 13, 2024

CLAassistant commented May 16, 2024 •

edited

Loading

[feat] - additional buffer pool #2829

[feat] - additional buffer pool #2829

Conversation

ahrav commented May 10, 2024 • edited Loading

Description:

Checklist:

rosecodym May 13, 2024

Choose a reason for hiding this comment

ahrav May 13, 2024

Choose a reason for hiding this comment

rosecodym May 13, 2024

Choose a reason for hiding this comment

ahrav May 13, 2024

Choose a reason for hiding this comment

CLAassistant commented May 16, 2024 • edited Loading

ahrav commented May 10, 2024 •

edited

Loading

CLAassistant commented May 16, 2024 •

edited

Loading