Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate images #285

Closed
andrewvc opened this issue May 24, 2021 · 6 comments · Fixed by #290
Closed

De-duplicate images #285

andrewvc opened this issue May 24, 2021 · 6 comments · Fixed by #290
Assignees
Labels
refined Issue refined, ready to work on v7.14.0

Comments

@andrewvc
Copy link
Contributor

Currently screenshots take up too much storage space in synthetics. While there are many approaches to improving this this issue covers our initial approach which will be de-duplication of images, or rather sections of images that repeat.

This is accomplished via a content –addressable scheme of storage where we take each image slice it up into n parts, hash each part and use that hash as an elasticsearch document ID. We can then describe each image as a series of image processing operations compositing these parts into a final canvas either on the client or server-side.

See this PR: #282

@vigneshshanmugam
Copy link
Member

Did some quick benchmarking today with our current approach. The approach looks promising and we should definitely do this and explore more options without killing ES.

Benchmark results

Scenario

In all the below results, both todos and elastic homepage is run

Todos journeys - All tests under our example/todos directory
No of runs - 10

Elastic inline journey - Go to homepage and hover over products
No of runs - 25

Results

Index size based on the three tests

  • Dedupe based on screenshot blocks - based on this De-dupe screenshots #282
  • Current master
  • Dedupe based on complete image block - Compute hash based on screenshot buffer and override the existing document id if present. Close to dedupe PR, but uses the complete screenshot instead of blocks.
Tests Dedupe blocks Dedupe on complete buffer Master
Elastic inline journey 1 mb 2.9mb 3mb
Todos journey 580 kb 326 kb 1.2mb

Summary

  • One of the issue that i found with our current deduplication approach based on De-dupe screenshots #282 is that, It puts lots of pressure on the ES as the documents are updated at the constant rate and the main process was killed in our todos example. We might need to do more tests with increasing the block size and see if it improves this situation.

  • As a first step, we could also dedupe on the complete buffer which does not have the same issue as the current PR, but also wont have massive improvements if the website has animations and the screenshot changes between every run. Ex: the inline journeys did not get massive improvement as elastic website has a small animation that changes the hash on each run. However, todos being the same image has a great compression ratio.

@andrewvc
Copy link
Contributor Author

Great work, amazing to see a 3X improvement on the elastic.co website! I can only imagine that the savings increase the more runs that are done.

Can you say more about the pressure on ES? One simple optimization we have not yet done is to use the _create API which should make any duplicate blocks a noop. Additionally, within a journey we can track block hashes and only output block objects hash that has not yet been seen.

It looks like de duping on the complete buffer fared very poorly on the elastic.co website. That doesn't seem to be a viable approach if it's dependent on customers having very simple sites as with the todos journey which don't vary over time.

Iit makes sense to stay with the blocks because we have additional avenues for optimization as well. For instance, we could do visual diffs between images to further compress individual blocks. Think I-Frames in video processing. So long as the ImageRef objects are essentially lists of compositing operations that gives us a lot of room to optimize things on the agent side

@andrewvc
Copy link
Contributor Author

I'll add that there are a lot of avenues to pursue in terms of optimization here but I think the next step is to get a solid implementation down. We can iterate in the future and improve things.

@vigneshshanmugam
Copy link
Member

vigneshshanmugam commented May 24, 2021

It looks like de duping on the complete buffer fared very poorly on the elastic.co website. That doesn't seem to be a viable approach if it's dependent on customers having very simple sites as with the todos journey which don't vary over time.

++ to your points, It was just for educational purposes and see how well it performs. Ideally I dont expect users to have the same exact screenshot for each and every run.

Can you say more about the pressure on ES? One simple optimization we have not yet done is to use the _create API which should make any duplicate blocks a noop.

When the benchmarking was run, ES container was killed as the ES process became unhealthy and also queries took a like in order of > 20 seconds even for small match query. I believe its due to constantly changing the id of the underlying document, may be its worth checking again with the _create document approach that you have mentioned during our sync.

Iit makes sense to stay with the blocks because we have additional avenues for optimization as well. For instance, we could do visual diffs between images to further compress individual blocks.

💯 Lets iterate on our current approach and also figure out the optimal block size with more benchmarks.

@vigneshshanmugam
Copy link
Member

Ran the benchmarks with the _create operation as per the elastic/beats#25808 - More details - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

I have not encountered any ES pressure this time, Still unsure if the previous default indexing strategy was the cause for this.

Results

Todos journey 10 runs - 481 kb - 100kb less than previous run.

@vigneshshanmugam
Copy link
Member

vigneshshanmugam commented May 25, 2021

Tried with changing the block size to 16 which resulted in 256 images for an image (1280*720), the results were not that great (Todos journey 10 runs - 1.2 mb). Let's stick to 8 as per the PR which results in 64 image blocks for an image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined Issue refined, ready to work on v7.14.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants