-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel s3 store wrting #130
Conversation
Codecov Report
@@ Coverage Diff @@
## master #130 +/- ##
==========================================
- Coverage 82.95% 82.33% -0.62%
==========================================
Files 26 26
Lines 1883 1891 +8
==========================================
- Hits 1562 1557 -5
- Misses 321 334 +13
Continue to review full report at Codecov.
|
Hmmm, so my thought was a little different because this still requires that whoever is using the store to actually know that this is an S3Store. That shouldn't be the case, and no builder should be written as such. Can you confirm if the performance improvement is simply running multiple puts at the same time or if it's compression or something else? |
So I do observe the proper scaling with parallel writing like this. |
I was thinking about adding async_update here: maggma/src/maggma/cli/multiprocessing.py Line 207 in 9a8f5cc But that's just replicating what the process_item is already doing. |
i understand increasing the mrun speeds this up, but is the primary speed up because multiple cpu's are compressing or because multiple S3 gets are running? |
So doing everything else the same. |
Try this: |
Has anyone tested the parallel scaling for mongo too? Is |
it takes about the same time to zip and to upload.
566 ms ? 4.26 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)
308 ms ? 20.5 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) |
Those numbers don't look right. If that is the time cost of those operations it shouldn't be 20 seconds per item. Are you sure you're using %%timeit right? I thought the command you were measuring has to be on the same line. |
Basically, i'm trying to determine what is causing the slowdown. If it is infact compression, then multiprocessing will help. If it's not, and it just IO, then there is a much more effective fix using theadpool executors: from concurrent.futures.thread import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=16) # Set this to the max concurrent S3 puts, should be Store variable
for key,doc in docs.items():
pool.submit(test_s3.s3_bucket.put_object,Key=key, Body=doc) This can be done in update. |
The 20 seconds is time to go through a larger list of around ~30 times. The time above is just one item. |
For a bigger object, it's
1.24 s +- 10.2 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
595 ms +- 70.7 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) So it looks like it's always taking 2x as long in compressing as it is putting |
Although using pickle does open potentialy security flaws due to malicious code. What do you think @mkhorton ? There is also ujson but that still needs the jsanitize. |
Never, ever use pickle as an archival serialization format. Maybe try the |
It's not just security, it's fragility too. The pickle format has and does break with updates to Python and updates to the classes that it's serializing. It's not as bad as it used to be, but still not advised. |
Wouldn't be surprised if it's the |
Yeah message pack seems to work well. I'm very reluctant to allow IO in |
Another option is the native https://towardsdatascience.com/why-you-should-start-using-npy-file-more-often-df2a13cc0161 Though I think some versions of this use pickle under the hood so I'm not sure. |
Those don't work for python dictionaries though? This isn't meant for just one type of data but needs to be general. |
Oh ok, sorry was thinking for charge densities specifically |
So what exactly is jssanitize's role here? We control all of the data so in principle this shouldn't be necessary(?) -- if it's not valid JSON, it seems correct it should fail. Perhaps a better remedy is just to be stricter with the output of |
The purpose is really to take care of edge cases such as numpy array. |
Regardless of whether we are gonna use pickle
The speed scales with the processes up to 4, then tails off. Then if we want to modify the results of process_item asynchronously like strip the data away and add upload to s3 that can be done in this function? |
@jmmshn , the |
OK, sounds good I'll do the ThreadPoolExecutor On the formatting issue, it seems that the json step takes the same amount of time as the zlib step.
649 ms ? 15.1 ms per loop (mean ? std. dev. of 7 runs, 1 loop each)
597 ms ? 2.37 ms per loop (mean ? std. dev. of 7 runs, 1 loop each) I'll play with mgpack and pickle and maybe allow both in the aws store |
I'd say we just drop |
Do you see a performance difference between |
WIP reworked aws store
This pull request introduces 1 alert when merging a4e136f into 6f70bc5 - view on LGTM.com new alerts:
|
This pull request introduces 2 alerts when merging e8a6737 into 6f70bc5 - view on LGTM.com new alerts:
|
No real difference and it breaks some tests since mock_s3 doesn't like it when you pass that around. |
Sounds good. Don't worry about the docs, I'll fix that. Just take off the WIP tag when you're ready. |
Done |
Moved the writing to s3 into its own function and allowed the upload to be skipped in the update function. (Tested and I do get scaling on our current minio setup)
The user can set that flag and call the s3 writing function in process_items to do parallel writing
Also moved the
s3_bucket
initialization into the constructor. This seemed to affect speed too.@munrojm is also testing this out now for the BS migration and he's seeing is big improvements.