Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OC_FileChunking on external file systems #4997

Closed
DeepDiver1975 opened this issue Sep 26, 2013 · 57 comments
Closed

OC_FileChunking on external file systems #4997

DeepDiver1975 opened this issue Sep 26, 2013 · 57 comments
Assignees
Milestone

Comments

@DeepDiver1975
Copy link
Member

The current implementation of OC_FileChunking is storing the chunks in the servers file cache (hidden/private folder on the server).

After all chunks have been received the file is reassembled and moved to the final target location.

While this approach seems valid for local filesystems this can case issues with respect to external filesystems and big files, because the final move of assembled file will consume some time.

From the overall system behavior it might be interesting to move the chunks already to the external filesystem.

My suggestion would be to store the chunks already in the target file but name it differently, append the .part and even make it hidden if possible.

This approach would speedup the process on the local filesystem as there is no reassembly step which as of today adds additional execution time.

@karlitschek @bartv2 @icewind1991 @danimo @dragotin
Please add your comments - THX

@dragotin
Copy link
Contributor

👍 as mentioned in bug #2947 already which was killed by a close-bugs-no-matter-what wave unfortunately ;-)

@bartv2
Copy link
Contributor

bartv2 commented Oct 4, 2013

This is a good idea, we only need to add an offset header to know where in the file to write the chunk. This way the order of chunks is not important

@DeepDiver1975
Copy link
Member Author

This is a good idea, we only need to add an offset header to know where in the file to write the chunk. This way the order of chunks is not important

we could use the http range header for this

@DeepDiver1975
Copy link
Member Author

@karlitschek oc7 I guess 😉

@icewind1991
Copy link
Contributor

Assembling the chunks on the external file system will be a problem though most backends don't support partial writes or appends, the only reliable way to append two uploaded chunks on an external file system would be to download both chunks and upload the result.

@DeepDiver1975
Copy link
Member Author

In such a scenario we need to stick with the approach we have today.

What about moving this to the storage api and let the storage implementation handle this? 

@karlitschek
Copy link
Contributor

Yes. ownCloud 7

@icewind1991
Copy link
Contributor

Moving it to the storage API might help but I only know of one backend (ftp) that might be able to provide a better then default implementation.
so I don't think it will help much in the end.

@DeepDiver1975
Copy link
Member Author

it will help with irods as well - not the most common storage but still ..... 😉

@PVince81
Copy link
Contributor

PVince81 commented May 2, 2014

Any more ideas on this ? After reading the thread that raises some relevant issues about the proposed approaches, I can't think of any other at the moment.

@PVince81
Copy link
Contributor

PVince81 commented May 2, 2014

I heard that 1.6.0 has configurable chunk sizes. If we can increase the chunk size to reach the max upload size, it might help when uploading files to ext storage as a single chunk, at least for bigger files.
This way the sysadmin, if they wish so, could manually increase the max upload size to bigger values to match the kind of file sizes expected in their environment.

@dragotin
Copy link
Contributor

dragotin commented May 5, 2014

@PVince81 remember that we also introduced the chunking to not get hit by the request timeout.

@DeepDiver1975
Copy link
Member Author

I heard that 1.6.0 has configurable chunk sizes.

We might want to update the protocol in this case - I can think scenarios where the chunk size can become an necessary information on the server. As of today the chunk size is not communicated.

@dragotin another topic for Wednesday?

@dragotin
Copy link
Contributor

dragotin commented May 5, 2014

Yes, but I do not really see a big benefit in changing/configuring the chunk size.

@craigpg craigpg modified the milestones: ownCloud 8, ownCloud 7 Sep 2, 2014
@MTRichards
Copy link
Contributor

This is part of a larger conversation to be had around external storage. For example, if you upload a file once to the server in chunks and reassemble, and then put it to a backend server...you are uploading twice: once to the ownCloud server, once to the backend.

This is part of a larger oC 8 concept we need to look at that allows direct access where possible, streaming if the backend supports it, and then this as a last effort - as performance is slowest here.

@dragotin
Copy link
Contributor

This is not an Enhancement, but a bug, the mirall issue tracker is full of bug reports about timed out uploads of big files (Example owncloud/client#2074). Please re-tag to bug.

Solution proposal is #12097

@DeepDiver1975
Copy link
Member Author

@dragotin do you face these timeouts only on external storage or on local storage as well?

@dragotin
Copy link
Contributor

also on local storage. Its a matrix of computer speed and load and size of the file to process and timeout value, so it will always simply be possible to time out.

@DeepDiver1975
Copy link
Member Author

okay - we can for now at least get rid of the reassembly step ... I'll take care of that - NOW

@DeepDiver1975 DeepDiver1975 self-assigned this Nov 11, 2014
@PVince81
Copy link
Contributor

@DeepDiver1975 if I understand well, will you store the part file directly on the external storage, and use fseek() to fill in the chunk gaps ?

Note that fseek() doesn't work on most external storages.

The other approach, simply putting the chunks onto the external storage server will be even slower because the workflow would be as follows:

  1. Upload chunk1 to SMB
  2. Upload chunk2 to SMB
  3. ...
  4. Upload chunk N to SMB
  5. Read chunk 1 ... N from SMB and at the same time write the part file to SMB
  6. Rename part file on SMB to the final file name.

Putting the chunks directly with this methods adds an additional overhead for each chunk, and also the additional overhead of having the re-download the chunks from SMB for assembly.

@DeepDiver1975
Copy link
Member Author

okay - we can for now at least get rid of the reassembly step ... I'll take care of that - NOW

First step will be to not save the chunks individually on the server and then reassemble them but to write the chunks to a temp file and once done this file can be pushed to the final destination.

Regarding external storage this will not help much because the file once reassembled has to be pushed forward anyhow.

@PVince81
Copy link
Contributor

Ok, so if I understand well the following will happen:

  1. Server receives chunk 1, saves it at offset 0 in part file inside "data/$user/cache/file.part"
  2. Server receives chunk 2, saves it at offset 10 MB in part file inside "data/$user/cache/file.part"
  3. ...
  4. When all chunks received, upload "file.part" to the final location

When the chunks do not arrive in order, it will simply use fseek() on the part file to put the chunk at the right position (bittorrent style)

Is that correct ? 😄

@DeepDiver1975
Copy link
Member Author

exactly

@DeepDiver1975
Copy link
Member Author

@dragotin understood and we will get into this discussion as well - but this has to be done also because we are shifting bytes around too many times.

@dragotin
Copy link
Contributor

@DeepDiver1975 yes, I understood this now as well ;-) thanks...

@PVince81
Copy link
Contributor

The interface defs look good. Makes sense!

@DeepDiver1975
Copy link
Member Author

In order to bring this in we need to resolve the encryption issue as discussed in #12006 (comment)

@icewind1991 @schiesbn @PVince81 @karlitschek @craigpg we need to make a decission - otherwise and written piece of code to solve chunked upload will be pointless

@PVince81
Copy link
Contributor

PVince81 commented Dec 8, 2014

@icewind1991 proposed to port the file proxys to be called inside a storage wrapper. That would be a quick workaround that saves us to port the whole encryption app to a storage wrapper short term (but which we'll have to do eventually!). Here it is: #12701

Another possibly alternative if that doesn't work is to make the Storage's getChunkHandler work as follows: if the storage supports seeking (fseek) AND encryption is disabled, then upload chunks directly into the target storage (in the part file), else fallback to the old mechanism. This also implies that the old mechanism is available as it's own LocalCacheChunkHandler implementation.

We'll see how the experiments go.

@PVince81
Copy link
Contributor

PVince81 commented Dec 8, 2014

The chunk handler experiment is here: #12160

@PVince81
Copy link
Contributor

PVince81 commented Jan 7, 2015

Some additional ideas coming from #13157.
When the ext storage server is ownCloud (ownCloud backend or server to server sharing), we might not need part files at all. So every storage should also be possible to tell whether they need part files or not.

I suspect that Dropbox might not need part files either if their upload mechanism is atomic as well. I don't think that they'll directly overwrite the current file like FTP or SMB would do. This could save the part file step.

@DeepDiver1975 DeepDiver1975 modified the milestones: 8.1-next, ownCloud 8 Jan 9, 2015
@VincentvgNn
Copy link

@PVince81
Help, I cannot upload and share large files of 20-50MB !!
Please judge whether my problem fits to this issue.

I have OC server 7.0.4 running on a 5GB space at a webhost. The connection is very slow, upload speed 20 - 60 kB/sec. When uploading these large files, I get at about each 6 min a "Connection closed" and "Operation cancelled" error message.
From an OC admin account I have 8 folders r/w shared with client 1 and client 2 (without encryption).
Client 1: Win XP and OC version 1-7.1
Client 2: Win 8.1 and OC version 1.8.0.4536-nightly20150108
Data size: 280 MB, 1600 files, 400 folders.
The large 20 MB+ files are uploaded at the end when most of data has already been shared. It doesn't matter much whether client 1 or client 2 starts uploading all this data. The indexing after a reconnect takes quite long (1-2 min.).
The server cash folder of the uploading client get's many "chunking" files (100 MB). In the attached picture you can see the size and the 6 min. timing interval of these files. The original file is 20.951 kB. All chunking files are just a little bit too small for completing the final file. When uploading from the Win XP side the upload speed is lower and the chunks are smaller (about 2-8 MB).
150213 oc chunking files
Note: The chunks are on the server, but as far as I remember, I did not have this problem when I was using client 1.6.4. At that time I could upload these large files (even including encryption).

@PVince81
Copy link
Contributor

@VincentvgNn it mostly looks like the timeout makes it unable to finish the chunks ? Are you using php-fpm which auto-kills PHP processes when the network connection is broken ? (mod_php doesn't kill the PHP process, it would continue working)

@VincentvgNn
Copy link

@PVince81
Thanks for your response.
Where can I find that php-fpm or mod_php setting? Within the OC server settings (as downloaded and set by the auto-installer) or outside?
I just tried phpinfo.php and found PHP-version 5.5.21 and a lot more, but nothing about these settings. And above, I would not be able to change something there.

Shouldn't a killed process be able to recover by using the available "chunking" files?
As it is now the cashed files seem to stay forever.

@PVince81
Copy link
Contributor

@VincentvgNn it's not a setting it's your setup. phpinfo might tell you.

If the chunk files aren't complete there is no way to recover them. The client should then resend the chunk again, with a different transaction id (@dragotin can you confirm?). This would leave canceled chunks lying around, which is a known issue.

@VincentvgNn
Copy link

@PVince81
php-fpm was introduced from PHP 5.3.3 onwards. So I bet that I have php-fpm and and no more mod_php. Phpinfo mentions the authors for CGI / FastCGI and FastCGI Process Manager, but nothing about php-fpm or mod_php.
I seem to experience auto-kill for PHP processes at a broken connection.

@VincentvgNn
Copy link

@PVince81
On Feb. 13th you confirmed that chunks lying around is a known issue.
Today I tested server 8.0.2 and client 1.8.0. There was a moment where the file transfer ran very slow resulting in quite some interrupts and producing many chunks. The chunks occupied more space than the good files!
Is anyone working on solving the problem of chunks that should be automatically removed after some time?

@PVince81
Copy link
Contributor

The chunk cleanup will be a background job starting with 8.1: #14500

@PVince81
Copy link
Contributor

We're past feature freeze => 9.0.

@PVince81 PVince81 modified the milestones: 9.0-next, 8.2-current Sep 21, 2015
@DeepDiver1975
Copy link
Member Author

closing this in favor of #20118

@lock lock bot locked as resolved and limited conversation to collaborators Aug 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants