Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1849: Hash each digest algo in separated threads #1850

Merged
merged 5 commits into from
Oct 11, 2023

Conversation

aberenguel
Copy link
Contributor

Implements #1849

@lfcnassif lfcnassif linked an issue Aug 30, 2023 that may be closed by this pull request
@lfcnassif
Copy link
Member

Thank you @aberenguel! Just pushed a small change to use less threads when possible, with the same upper bound. I'll run a small test and a medium size one to see how performance behaves.

@aberenguel
Copy link
Contributor Author

Great! Thanks, @lfcnassif !
I had misunderstood how Tasks were instanced and finished. I'd thought it were instanced by item, but I've seen that it is instanced by worker.

Now I think it is very good!

@aberenguel
Copy link
Contributor Author

I ran in a small case (500k itens) and haven't seen any significant difference.
Now I'll try a bigger one.

@lfcnassif
Copy link
Member

lfcnassif commented Aug 31, 2023

My small case (250K items) test result was the same as yours, no significant difference.

But the medium size test (2.2M items) result was not good... Total processing time increased from ~6,600s to ~8,600s. The evidence was in the local network, so it could have been affected by other storage or network activity, although the test was run after midnight... I'll copy the evidence locally and repeat the test.

Let's see how your test behaves.

@lfcnassif
Copy link
Member

lfcnassif commented Aug 31, 2023

Another thought, ideally we should also test this on a commodity computer with less CPU power, not every user has a machine like ours...

@lfcnassif
Copy link
Member

Just finished the test with the 2.2M items evidence (750GB) processed locally. Results seem fine:

  • HashTask time decreased from 635s to 545s
  • Total processing time decreased just a bit from 6652s to 6635s.

@aberenguel
Copy link
Contributor Author

Great! I think the bigger difference will be when it finds a big file that is being hashed when other workers are idle.

@lfcnassif
Copy link
Member

This test evidence is very mixed, with files of different kinds. There is a VDI of 40GB inside it. Of course other cases can benefit much better of this optimization. My concern was about performance loss in some cases, but seems we shouldn't worry about it.

@lfcnassif
Copy link
Member

Hi @aberenguel, did you have time to run this on your larger case? If it seems good, I think we can merge this.

@aberenguel
Copy link
Contributor Author

Hi @aberenguel, did you have time to run this on your larger case? If it seems good, I think we can merge this.

I am processing again. The previous processing was not realiable because I was doing another things in computer. Now the test is scripted. Tomorrow I'll post the results.

@aberenguel
Copy link
Contributor Author

I processed in my personal computer a folder with 1.1M files. I ran twice with default profile.

Profile master (f6eae92 ) [seconds] hashparallel (merged) [seconds]
default 1697 / 1742 1716 / 1810
forensic 2707 2678

@lfcnassif
Copy link
Member

Great! Seems fine to me, differences look normal fluctuations. If it is your larger (last) test and if no one objects, I'll merge this soon.

@aberenguel
Copy link
Contributor Author

aberenguel commented Sep 11, 2023

I ran it again in a case with 8354170 items (560222 MB).
It seems it got worse.

master hashparallel
03:45:55 04:00:47

I'm executing the test again for a last time.

@aberenguel
Copy link
Contributor Author

Another test in a E01 file (related to an 160GB disk), with profile forensic and OCR enabled.

master -> 09h 27m 26s
hash_parellel -> 09 h25m 16s

@lfcnassif
Copy link
Member

Great, thank you! Have you had a chance to repeat the previous test that got a bit worse?

@lfcnassif
Copy link
Member

Another test in a E01 file (related to an 160GB disk), with profile forensic and OCR enabled.

master -> 09h 27m 26s hash_parellel -> 09 h25m 16s

With OCR enabled, there is more pressure on CPU, so I think this is a good scenario to test more thread concurrency and results were fine. @aberenguel, can I merge this or do you intend to repeat the previous test that got a bit worse?

@lfcnassif
Copy link
Member

Given previous tests, I'm going to merge this. If anyone think this can cause a significant performance regression in some processing scenario and have any evidence of that, please let me know.

Copy link
Member

@lfcnassif lfcnassif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aberenguel!

@lfcnassif lfcnassif merged commit 883b72a into sepinf-inc:master Oct 11, 2023
@aberenguel aberenguel deleted the hash_parallel branch October 4, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize HashTask using parallelism
2 participants