De-duplicate images #285

andrewvc · 2021-05-24T15:49:41Z

Currently screenshots take up too much storage space in synthetics. While there are many approaches to improving this this issue covers our initial approach which will be de-duplication of images, or rather sections of images that repeat.

This is accomplished via a content –addressable scheme of storage where we take each image slice it up into n parts, hash each part and use that hash as an elasticsearch document ID. We can then describe each image as a series of image processing operations compositing these parts into a final canvas either on the client or server-side.

See this PR: #282

The text was updated successfully, but these errors were encountered:

vigneshshanmugam · 2021-05-24T21:10:05Z

Did some quick benchmarking today with our current approach. The approach looks promising and we should definitely do this and explore more options without killing ES.

Benchmark results

Scenario

In all the below results, both todos and elastic homepage is run

Todos journeys - All tests under our example/todos directory
No of runs - 10

Elastic inline journey - Go to homepage and hover over products
No of runs - 25

Results

Index size based on the three tests

Dedupe based on screenshot blocks - based on this De-dupe screenshots #282
Current master
Dedupe based on complete image block - Compute hash based on screenshot buffer and override the existing document id if present. Close to dedupe PR, but uses the complete screenshot instead of blocks.

Tests	Dedupe blocks	Dedupe on complete buffer	Master
Elastic inline journey	1 mb	2.9mb	3mb
Todos journey	580 kb	326 kb	1.2mb

Summary

One of the issue that i found with our current deduplication approach based on De-dupe screenshots #282 is that, It puts lots of pressure on the ES as the documents are updated at the constant rate and the main process was killed in our todos example. We might need to do more tests with increasing the block size and see if it improves this situation.
As a first step, we could also dedupe on the complete buffer which does not have the same issue as the current PR, but also wont have massive improvements if the website has animations and the screenshot changes between every run. Ex: the inline journeys did not get massive improvement as elastic website has a small animation that changes the hash on each run. However, todos being the same image has a great compression ratio.

andrewvc · 2021-05-24T21:42:54Z

Great work, amazing to see a 3X improvement on the elastic.co website! I can only imagine that the savings increase the more runs that are done.

Can you say more about the pressure on ES? One simple optimization we have not yet done is to use the _create API which should make any duplicate blocks a noop. Additionally, within a journey we can track block hashes and only output block objects hash that has not yet been seen.

It looks like de duping on the complete buffer fared very poorly on the elastic.co website. That doesn't seem to be a viable approach if it's dependent on customers having very simple sites as with the todos journey which don't vary over time.

Iit makes sense to stay with the blocks because we have additional avenues for optimization as well. For instance, we could do visual diffs between images to further compress individual blocks. Think I-Frames in video processing. So long as the ImageRef objects are essentially lists of compositing operations that gives us a lot of room to optimize things on the agent side

andrewvc · 2021-05-24T21:45:07Z

I'll add that there are a lot of avenues to pursue in terms of optimization here but I think the next step is to get a solid implementation down. We can iterate in the future and improve things.

vigneshshanmugam · 2021-05-24T22:38:06Z

It looks like de duping on the complete buffer fared very poorly on the elastic.co website. That doesn't seem to be a viable approach if it's dependent on customers having very simple sites as with the todos journey which don't vary over time.

++ to your points, It was just for educational purposes and see how well it performs. Ideally I dont expect users to have the same exact screenshot for each and every run.

Can you say more about the pressure on ES? One simple optimization we have not yet done is to use the _create API which should make any duplicate blocks a noop.

When the benchmarking was run, ES container was killed as the ES process became unhealthy and also queries took a like in order of > 20 seconds even for small match query. I believe its due to constantly changing the id of the underlying document, may be its worth checking again with the _create document approach that you have mentioned during our sync.

Iit makes sense to stay with the blocks because we have additional avenues for optimization as well. For instance, we could do visual diffs between images to further compress individual blocks.

💯 Lets iterate on our current approach and also figure out the optimal block size with more benchmarks.

vigneshshanmugam · 2021-05-24T22:59:52Z

Ran the benchmarks with the _create operation as per the elastic/beats#25808 - More details - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

I have not encountered any ES pressure this time, Still unsure if the previous default indexing strategy was the cause for this.

Results

Todos journey 10 runs - 481 kb - 100kb less than previous run.

vigneshshanmugam · 2021-05-25T17:32:11Z

Tried with changing the block size to 16 which resulted in 256 images for an image (1280*720), the results were not that great (Todos journey 10 runs - 1.2 mb). Let's stick to 8 as per the PR which results in 64 image blocks for an image.

This was referenced May 24, 2021

Spike: Reduce disk usage of screenshots / network data #266

Closed

[META] Synthetics GA Release (WIP) #93

Closed

paulb-elastic added [zube]: Inbox v7.14.0 and removed [zube]: Inbox labels May 24, 2021

andrewvc mentioned this issue May 24, 2021

[Uptime] Re-assemble chunked synthetics images elastic/kibana#100474

Closed

paulb-elastic added the refined Issue refined, ready to work on label May 25, 2021

paulb-elastic assigned vigneshshanmugam May 25, 2021

paulb-elastic added [zube]: In Progress and removed [zube]: Ready labels May 25, 2021

vigneshshanmugam mentioned this issue May 26, 2021

feat: post process screenshots for deduping #290

Merged

justinkambic mentioned this issue Jun 1, 2021

[Synthetics] Support de-dupe screenshot strategy elastic/kibana#101122

Closed

vigneshshanmugam closed this as completed in #290 Jun 3, 2021

zube bot added [zube]: Done and removed [zube]: In Progress labels Jun 3, 2021

paulb-elastic mentioned this issue Jun 4, 2021

[Meta] Screenshot and Filmstrip Features elastic/uptime#333

Open

paulb-elastic removed the [zube]: Done label Aug 5, 2021

vigneshshanmugam mentioned this issue Sep 29, 2021

Run tests separately by using a custom command #384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicate images #285

De-duplicate images #285

andrewvc commented May 24, 2021

vigneshshanmugam commented May 24, 2021

andrewvc commented May 24, 2021

andrewvc commented May 24, 2021

vigneshshanmugam commented May 24, 2021 •

edited

Loading

vigneshshanmugam commented May 24, 2021

vigneshshanmugam commented May 25, 2021 •

edited

Loading

De-duplicate images #285

De-duplicate images #285

Comments

andrewvc commented May 24, 2021

vigneshshanmugam commented May 24, 2021

Benchmark results

Scenario

Results

Summary

andrewvc commented May 24, 2021

andrewvc commented May 24, 2021

vigneshshanmugam commented May 24, 2021 • edited Loading

vigneshshanmugam commented May 24, 2021

Results

vigneshshanmugam commented May 25, 2021 • edited Loading

vigneshshanmugam commented May 24, 2021 •

edited

Loading

vigneshshanmugam commented May 25, 2021 •

edited

Loading