-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving ext storage with streaming #5949
Comments
Here is an example how to use streams with fopen(): |
@schiesbn the only fseek() call I know so far is the one to detect whether a file is encrypted. As we discussed before, this can be replaced with an approach that simply detects whether keys exist for that file. |
And additionally we could experiment with parallel http request to reduce roundtrip time. . |
Just saw that the webdav.php storage already uses curl, but it still does a full-download of the files. And the FTP storage does a fopen() directly on the ftp:// URL. |
|
For OC7 we will try and introduce streaming to WebDAV mounts and SMB mounts. |
@PVince81 Doing WebDAV streaming should be not too hard to implement. How do you want to approach SMB/CIFS streaming? |
Piping with smbclient. I've already experimented with it and it works, see #4770 (comment) But first we need to get @icewind1991 's new SMB impl in: #4770 |
We don't create a temp file of the whole file, just for the first 5MB. |
@georgehrke yes I know, that's why streaming ext storage will improve that as well. |
@PVince81 Ah okay, I was expecting a PHP implementation of SMB/CIFS, hehe. :-D |
Just had a look at WebDAV streaming. I also looked at curl and it seems that it's not possible to pipe files to curl, at least not using the "curl_exec" functions. |
@PVince81 How much work would it be to implement a stream wrapper that can be written to? |
Not sure, but it could be an idea. There are callback functions for read and write, so these could be used to buffer the next block in the wrapper. I've also found this: http://guzzle.readthedocs.org/en/latest/ (which happens to be used by the Amazon API lib) |
Is this still in the roadmap? In my case I'm now using a workaround. I substitute the code in
and I can now stream files directly from the external remote. This may break stuff elsewhere (for instance any place that calls fseek on the returned stream) and it is not 100% functionally equivalent to the code I substituted (for instance maybe SSL validation) but now I can finally use external storage in my usecase (starting to consume > 1GB files as quickly as possible while reading from a very slow external storage). I studied a bit a possible real solution to the problem starting from the pointers that PVince81 added above. One way to properly fix this seems to be to create a new stream wrapper that saves the file to a local storage (to allow seeking) and returns data as soon as it is available. Ingredients of this solutions seem to be the function |
I don't really understand how and why your code change would allow streaming. From what I heard there is no such thing as multi-threaded PHP, unless maybe if the called function is implemented in C/C++ in the back 😦 There are already a few stream wrappers used in the external storage backends that work almost like you said, but the temporary file is first fully downloaded before access to it is given. Another idea, for seeking, is that the caller can inform whether fseek is required or not. Needs further research. |
See #10620 |
Thanks for the feedback. Turns out that as you stated there is not multithreading in general in PHP. I studied a bit more the problem a mix of what you suggest (multicurl) and what I was proposing can be used to implement something that works. Below I put a small sample prototype of the approach I propose: <?php
class SeekableHttpStream {
// keeps track of the number of bytes written to the temporary file
var $read_bytes = 0;
// context used to download the file
var $curl;
var $multi_curl;
// a file resource pointing to a temporary file where we store the download
var $temp_file;
// different than null as long as we are downloading from the remote server
var $active;
// position at which our client (the person that opened the stream with our)
// protocol wants the next read to happen
var $position = 0;
function write_to_temp($res, $data) {
// here we write the newly received data at the end of the temporary file
fseek($this->temp_file, $this->read_bytes);
$written = fwrite($this->temp_file, $data);
$this->read_bytes += $written;
return $written;
}
function stream_open($path, $mode, $options, &$opened_path) {
// open the temporary file that will store the partial download
$this->temp_file = fopen("php://temp", "rwb");
// setup the download
$this->curl = curl_init();
$real_url = "http".substr($path, strlen("http+curl")) ;
curl_setopt($this->curl, CURLOPT_URL, $real_url);
curl_setopt($this->curl, CURLOPT_HEADER, 0);
// set a callback that will be called when new data comes in
curl_setopt($this->curl, CURLOPT_WRITEFUNCTION, array($this, "write_to_temp"));
$this->multi_curl = curl_multi_init();
curl_multi_add_handle($this->multi_curl,$this->curl);
$this->active = null;
// start the download
do {
$mrc = curl_multi_exec($this->multi_curl, $this->active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
if ( $mrc != CURLM_OK ) {
$this->stream_close();
}
error_log($this->active);
return true;
}
function stream_close() {
// clean up when closing the download
$active = null;
curl_multi_remove_handle($this->multi_curl, $this->curl);
curl_multi_close($this->multi_curl);
}
function stream_read($count) {
if ( $this->stream_eof()) {
return FALSE;
}
$available_count = min( $this->read_bytes - $this->position, $count);
fseek($this->temp_file, $this->position);
$data = fread($this->temp_file, $available_count);
$this->position += strlen($data);
return $data;
}
function stream_tell() {
return $this->position;
}
function wait_data() {
// wait for incoming data
if (curl_multi_select($this->multi_curl) != -1) {
// read the data
do {
$mrc = curl_multi_exec($this->multi_curl, $this->active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
if ( $mrc != CURLM_OK) {
$this->stream_close();
}
} else {
$this->stream_close();
}
}
function wait_for_end() {
// wait until the download is over
while ( $this->active) {
$this->wait_data();
}
}
function wait_for_position($position) {
// wait until we read past the position or the download is over
while ( $position >= $this->read_bytes && $this->active) {
$this->wait_data();
}
}
function stream_eof() {
// here we could rely on the Content-Lenght field of the header
$this->wait_for_position($this->position);
// return FALSE if the request is past the end of the file
return $this->position >= $this->read_bytes ;
}
function stream_seek($offset, $whence) {
switch ($whence) {
case SEEK_SET:
$this->wait_for_position($offset);
if ($this->read_bytes > $offset) {
$this->position = $offset;
return TRUE;
} else {
return FALSE;
}
break;
case SEEK_CUR:
return $this->stream_seek($this->position + $offset, SEEK_SET);
case SEEK_END:
// here we could rely on the Content-Lenght field of the header
$this->wait_for_end();
return $this->stream_seek($this->position + $offset, SEEK_SET);
default:
return false;
}
}
}
stream_wrapper_register("http+curl", "SeekableHttpStream")
or die("Failed to register protocol");
$fp = fopen("http+curl://127.0.0.1/infinite-file.php", "r");
print "First read: ";
print fread($fp, 4);
print "<br>";
print "Second read: ";
print fread($fp, 4);
fseek($fp, 0);
print "<br>";
print "Third read: ";
print fread($fp, 4);
?> The approach could be optimized significantly (for instance by initiating range requests when the file is seeked on a region that is not yet downloaded and by using the Content-Lenght headers to predict the file size. |
I prepared a prototype of the changes needed to fix this bug, it is only partially tested (i.e. I downloaded a file and it worked ) : https://github.com/lokeller/core/commit/fcb1edb3b590ba8bba7f6801ea7fdc79a2400008 I put the commit in public domain, if you feel it may be useful in OwnCloud feel free to include it ( I don't want to sign any contribution agreement so I won't create a proper pull request). If there is interest I can polish and test it a bit more. |
@lokeller Hey. Signing a contributors agreement is not a requirement for submittung pull requests to ownCloud. After submitting a pull request, you will get the chance to alternatively state that your contribution is MIT licensed. :-) |
That's great! I'll test a bit further my patch and then I will create the pull request so it will be easier to review it. |
See #11000 |
Looks like Guzzle could be used with a stream factory: https://github.com/owncloud/core/pull/19002/files#diff-2ab592dfd95e06115a358d1b5b20cdc5R318 |
Okay, looks like it might be even easier, as long as the remote supports HTTP: https://github.com/owncloud/core/pull/18653/files#diff-782207b41e0a420a99054e41c7e7946dR348 |
|
We might need to fix the encryption code first to not assume that |
Streaming upload would help reduce the occurrences of PHP timeouts like #24454 (comment) |
@mmattel FYI I added this to the planning discussion: #24684 (comment) |
This issue has been automatically closed. |
Currently the performance of external storage is bad because the file must first be downloaded into a temporary file, then fopen() is called on that file and the handle is returned. This means that downloading a 4 GB file from an ext storage will first create a 4 GB temporary file which is then passed to the client as download.
In some cases like downloading, mimetype scanning (if we keep it), antivirus app, etc we are only interested in getting either the first bytes of the stream (fread() then fclose()) or stream the whole file sequentially. No fseek() needed.
In such cases, it might be more efficient to do a fopen() on the stream directly, if possible. It seems that PHP allows fopen() on HTTP URLs. We could just stream the body of the response as it comes into the hooks and back to the client.
I had a quick look and it looks like most external storages could be modified to use streaming, as many of them use HTTP calls anyway.
The alternative to this would be to use a library like CURL that uses threads to pre-download the file into the temporary file. The control could be given back to the caller before the file is finished downloading, so that they can already start working on the start of the temporary file.
Please let me know what you think of this idea.
@icewind1991 @karlitschek @schiesbn @DeepDiver1975 @bantu
List of storages
[WIP] Use stream wrapper that allows to seek HTTP files #11000stream webdav downloads using http client #18653The text was updated successfully, but these errors were encountered: