Skip to content
This repository has been archived by the owner on Aug 12, 2020. It is now read-only.

Data Exporting #25

Merged
merged 2 commits into from
Apr 19, 2016
Merged

Data Exporting #25

merged 2 commits into from
Apr 19, 2016

Conversation

nginnever
Copy link
Contributor

This is the beginnings of data-exporting. Looking for review and help on this one. cc @bgrieder @diasdavid @dignifiedquire

Currently this implementation passes an event emitter out and fires off ee.on('file') events to the listening component with an object containing the stream of data being pushed to as well as the path that is needed to write to disk. There are ee.on('endDir') ee.on('endNestDir' and ee.on('endBigFile') events to signal that the exporter is done receiving data. This is also done when the stream passed in the event pushes its last data.

The tests currently pass and I am able to write files and nested dirs of files to disk for the js-ipfs get command. I was having an issue with some video files >5mb not writing correctly (all of the bytes show up but the file is corrupt making it appear something was written out of order) when I implemented this code in the js-ipfs core cli get command. That currently shares a repo with go IPFS and I suspect that the error might lie there. However when testing writing of the same video files to disk from the test repo in this module the files came out correctly.

Also you'll notice I am skipping the web test for nested directory export. For some reason it is hanging for me with no helpful errors when I test it. If others could turn that test on and see what it does that would be great.

I will be making a PR to JS-IPFS shortly with a get command that uses this exporter. I still need to make sure that the cat command can listen to these streamed events but since cat does not deal with directories I don't see why it would be a problem.

Let me know what you guys think!

Thanks

exports.export = function () {
// export into files by hash
exports.export = function (hash, dagService, options, callback) {
if (typeof options === 'function') { callback = options; options = {} }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be on multiple lines, if you need to use semicolons ;)

@dignifiedquire
Copy link
Contributor

So, why do we need the end* events? Couldn't we have sth like this:

  • File (both with and without links)
    1. emit('file', {stream, path})
    2. When stream emits 'end' the file is finished
  • Directory
    1. emit('directory', {stream, path})
    2. The stream is an object stream, with each object of the form
      • type: Either 'file' or 'directory'
      • content: If 'file' then a stream of the file content, if a directory the same as this (an object stream)
    3. When stream emits 'end' when everything was streamed

@nginnever
Copy link
Contributor Author

Also one major thing I would like to mention is that currently for large files with links, the event emitter does not fire until all of the data has been buffered into the stream array. I wonder if this could cause memory problems for really large files and what a better solution to using the ee and getting the data to stream out as it is being pulled from each link would be.

@daviddias
Copy link
Contributor

File (both with and without links)
emit('file', {stream, path})
When stream emits 'end' the file is finished

Agree with this @dignifiedquire's proposal

Directory
emit('directory', {stream, path})
The stream is an object stream, with each object of the form
type: Either 'file' or 'directory'
content: If 'file' then a stream of the file content, if a directory the same as this (an object stream)
When stream emits 'end' when everything was streamed

We don't need to emit 'directories', for what is worth, a path is a key, the file should be emitted like

.emit('file', {path: 'path to file', readable: <stream>}, so that these streams can be piped through a multipart stream back to the CLI and then handled by CLI and written on disk, or handled by a browser app directly.

Also one major thing I would like to mention is that currently for large files with links, the event emitter does not fire until all of the data has been buffered into the stream array

Why does this has to happen? As soon as you know a file exists, you should emit the file event and then just write to the stream that was returned, so that chunks can be consumed as soon as possible (otherwise it defeats the purpose of having streams)

@nginnever
Copy link
Contributor Author

Also one major thing I would like to mention is that currently for large files with links, the event emitter does not fire until all of the data has been buffered into the stream array

Why does this has to happen? As soon as you know a file exists, you should emit the file event and then just write to the stream that was returned, so that chunks can be consumed as soon as possible (otherwise it defeats the purpose of having streams)

I am having problems getting the timing of the stream event and the async.series down. Either I'll get a not implemented error or a stream.push() after EOF error. Looking into switching to Highland to simplify all of this seems to be a good idea.

@daviddias
Copy link
Contributor

Will highland make that ipfs-core will have to learn how to consume highland streams? It is preferable to keep pure Node.js streams for the interface with files.

@dignifiedquire
Copy link
Contributor

highland can export pure node streams

@nginnever
Copy link
Contributor Author

nginnever commented Apr 15, 2016

I think the reason I am having trouble with files with links is that when I pass the stream in the event emitter, if the consumer gets the stream before all of the node.links[data buffer] buffers have been pushed to the streams array, the consumer will consume the array of buffers before the stream has had a chance to push the next node.link[data buffer] onto the stream. This could be why I'm getting error: not implemented even when I know i have passed the stream to the consumer with some initial data in it. Since fetching the data from the repo and unmarshalling takes longer than consuming the data itself.

Is there a mechanism in js to open a fs.createWriteStream that waits for data in the middle of a write?

@dignifiedquire
Copy link
Contributor

@nginnever lets have a call about this, I think that will be easier

@nginnever
Copy link
Contributor Author

@dignifiedquire @diasdavid

I made some edits based on our talk yesterday and the comments from dig. Which broke the way I was doing tests so I will need to rewrite that today when I get a chance.

I did not break the file export and direxport functions into separate files. I was thinking we could just get something working for now so I can move onto building the files api core function in js-ipfs that will use this and begin testing.

If you guys get a chance to look at this let me know if you have more suggestions. Thanks as always!

@dignifiedquire
Copy link
Contributor

@nginnever as promised a small example of the simplicity with highland:
https://gist.github.com/dignifiedquire/2d81b85074f153c0f2c28694bdb7eed9

@daviddias daviddias merged commit 3751be4 into ipfs-inactive:master Apr 19, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants