-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel uploads #91
base: master
Are you sure you want to change the base?
Parallel uploads #91
Conversation
- switched to using archive_id as item key in the bookkeeping db. - added updatedb command for people that are upgrading to change their db. - updated readme to reflect changes. - some minor clean-up of the files.
Works mostly (hash check currently fails and needs work).
It didn't work as it should...
If the flag --resume is given, glacier-cmd will check existing data in <out_file> and if it matches, continue the download from where it was.
Conflicts: glacier/GlacierWrapper.py glacier/glaciercorecalls.py
Mmap has a 2 GB limit; now mapping the relevant part of the files (either the part to be checked or the part to be uploaded) instead of attempting to map the complete file in one go.
Conflicts: glacier/GlacierWrapper.py
Conflicts: glacier/glacier.py
Added command line --sessions to give the number of upload sessions to use. It is far from perfect: at the moment issues with upload processes dying (amazon rejects our signature, as if the key is invalid?!) and then the whole thing just hangs and needs to be killed manually. This needs a workaround.
Will we ever migrate upload/download process to boto? What are the plans. They have parallel upload support too. |
Interesting, I missed that part of Boto. Will look into it, maybe it works better than my solution (I always get response errors). |
Updated docs
- switched to using archive_id as item key in the bookkeeping db. - added updatedb command for people that are upgrading to change their db. - updated readme to reflect changes. - some minor clean-up of the files. Fix for updatedb. It didn't work as it should... Fix for Python <2.7 mmapping file now in parts, instead of trying to mmap it completely. Fixed upload of large files. Mmap has a 2 GB limit; now mapping the relevant part of the files (either the part to be checked or the part to be uploaded) instead of attempting to map the complete file in one go. Hope to have improved reaction times on the connection (see issue uskudnik#71) Added resumption of downloads (untested). Download resumption stage 1. Works mostly (hash check currently fails and needs work). Implemented download resumption. If the flag --resume is given, glacier-cmd will check existing data in <out_file> and if it matches, continue the download from where it was. fix Implemented parallel upload sessions. Added command line --sessions to give the number of upload sessions to use. It is far from perfect: at the moment issues with upload processes dying (amazon rejects our signature, as if the key is invalid?!) and then the whole thing just hangs and needs to be killed manually. This needs a workaround. Updated documentation to include the new options and commands.
…acier-cmd-interface into parallel_uploads Conflicts: glacier/GlacierWrapper.py glacier/glacier.py
I have pulled this branch using the following code:
and then build it with:
However, it does not work:
Am I doing something wrong, or is there a bug somewhere? |
There's a bug. Please uncomment line 100 in GlacierWrapper.py MAX_PARTS = 10000On Fri, Oct 26, 2012 at 12:53 PM, Vidar Hoel [email protected]:
|
... and set it to 1000. Or change variable name of next line. On Fri, Oct 26, 2012 at 12:57 PM, Jaka Hudoklin [email protected]:
|
Whoops - only a 0 was supposed to go, not that S. My bad! Anyway it seems that the 10,000 parts should also work now? |
@wvmarle: Yes, both the current code and the code with MAX_PARTS = 10000 works. I have tested both. This code should be merged with the main branch, as it's fixing the bug I reported. |
Now it checks whether processes are still alive, and waits until the last upload process exits. Then in case the queue is not empty, a single process is created to clear up the work queue.
A week without any fixes - I will presume this is stable and merge tomorrow unless @wvmarle says otherwise and no new bugs are discovered. |
As stable as it gets I think. The only issue I have is the continuous and mysterious "response error" replies from Amazon... |
I get this issue with larger files:
At this point it just hangs, so I have to break (press CTRL+C) and that gives the following error:
When this happens, I just repeat the same command, just adding a Is this the same error you are referring to, or something you have not seen before? |
Yes, that's the issue I'm referring to. Very irritating. Dozens if not hundreds of parts are uploaded and accepted fine, and then suddenly the signature is not accepted (and by my understanding, signature is related to your login credentials - it's done by Boto and I've dug so deep as to know how that is done exactly). |
@wvmarle Any luck tracking bug down? Do you know if it's boto issue or Amazon? |
The problem here is different. First of all boto upload implemementation ...so that's why i'm reimplementing whole upload part fixing some if not
|
@uskudnik : nothing done yet on this one. Just got myself a set of new toys (including an Epson wifi printer: took me 6 hours to get that installed!! Had to hunt down an unofficial ISO of the installation CD as I don't have CDROM players anymore and the official downloads are broken...) and a new netbook :-) So my priorities are distracted :-) @offlinehacker : you mean you're re-implementing the boto upload routines? Wasn't that present in the original glaciercorecalls.py file already? |
@wvmarle : I am reimplementing whole uploaded and still deciding if i will Currently the most helpfull part for me would be better formating of And please start writing tests, before you implement anything else, or we |
CausedException is integrated into GlacierException, and the stack trace is dumped in the log file at DEBUG level. This as the users normally don't need to see this, and this way developers can still get it. Any non-caught exceptions of course dump the stack trace to screen. Agreed upload must not have bugs; writing tests otoh is also not easy until we fully and thoroughly understand Amazon's responses (like this response error issue) to be able to simulate errors. |
Tnx, but was wondering because we have a lot of copy-pastes without full On Sun, Nov 11, 2012 at 11:54 AM, wvmarle [email protected] wrote:
|
Exceptions done in that way as I want to make it look a lot better for the end users, while still being able to get to the stack trace if really needed. We may consider having a constant defined say |
Yes that flag would be cool, i support ;) I've also committed almost finished, but completely untested new upload implementation available on my github(don't even try to run it it won't start), but you can see the core ideas(function _upload in GlacierWrapper and class Part in corecalls). Completely same code can upload using multiprocessing and without it, using I will hopefully finish it tomorrow(without tests, which will come in later days) also taking quite some code from this commit. |
@ wvmarle Hi, i tried to use your parallel-upload branch but it seems to have some problem with files greater than 2GB. part = mmap.mmap(fileno=f.fileno(), I guess this error is similar to that one #99 |
Yes, same issue. |
Parallel uploads; upload resumption; download resumption. All in one go - should be able to apply this to master without conflicts.