Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide support for specifying an IPFS hash for a CDXJ file for the replay system #80

Closed
machawk1 opened this issue Jan 9, 2017 · 11 comments

Comments

@machawk1
Copy link
Member

machawk1 commented Jan 9, 2017

Current support is for a CDXJ at an accessible path. Related to #61 .

@ibnesayeed
Copy link
Member

ibnesayeed commented Jan 9, 2017

What we are supporting is essentially file:// protocol, but specifically on the locally available/accessible drives, without the protocol prefix. However, it can be generalized to read the index file from http://, https://, ftp://, or ipfs:// protocols. The biggest issue would be inability of efficient seek mechanism to perform binary search. HTTP does allow requesting byte range from the server, if the sever supports it, but that means there will be many successive HTTP requests (for partial data fetch) to discover the position in the index file. An alternative approach would be to cache the index file locally and perform binary search on that, but one has to be careful of setting up the right duration of cache expiration to balance between too many fetches and stale index.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed This ticket is to fetch the cdxj index file via ipfs, not HTTP. Parsing and search efficiency may be a different issue.

@ibnesayeed
Copy link
Member

That's what I talked about in the previous comment and some related matters.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed We ought to have a larger sample data set to test this; for example, a 500(+?)-line CDXJ with associated ipwb hashes inline as appropriate. Once this CDXJ is in IPFS, we can use the sample data as benchmarks both for selective fetch and pywb's binary search once the selective data is fetched. There are a slew of GH tickets that could be spawned from this. ;)

@ibnesayeed
Copy link
Member

Nothing stops us from storing index in IPFS or elsewhere, but I am against the idea of storing an index in IPFS that will be changing frequently.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed If working on a static corpus for, say, research, the index may not be changing frequently.

The crux of this ticket was really to not require the user to need to provide an (cdxj) index file to run the software but to be able to specify a hash, potentially shared by another user.

@ibnesayeed
Copy link
Member

Adding a special flag just to tell the server to treat the passed value as an IPFS hash and retrieve data from there would be too much embedded special cases in the application. Using the protocol prefix would be a more generalized approach and widely understandable.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

@ibnesayeed I would prefer smart defaults. If what "looks like" an IPFS hash is passed, treat it as so and process accordingly. I currently have a very fundamental case of this with reading in absolute/relative CDXJ files for the replay system. This also stinks of some potentially dangerous scenarios.

That said, allowing special flags to force the type of interpretation would be good. Maybe that should be the initial approach but there is something elegant about specifying ipwb replay myIndex.cdxj over ipwb replay -f myIndex.cdxj or ipwb replay -i <ipfs hash>.

@ibnesayeed
Copy link
Member

Even automatic detection of the hash signature is an embedded information.

ipwb replay ipfs://<ipfs_hash>

is not any more complex or difficult than

ipwb replay <ipfs_hash>

The earlier is more expressive and uniformly suggests that if implemented, any type of URL could be supported.

@machawk1
Copy link
Member Author

machawk1 commented Jan 9, 2017

Do you think ipwb replay /path/to/my/local/cdxj should be allowed or should the last argument require the explicit file:// scheme? The latter, while more expressive, seems verbose.

@ibnesayeed
Copy link
Member

In case of file:// I would call it optional, but supported either way as local file paths are naturally understood. However, if the file is on a remote machine (not available with HTTP), a file:// or smb:// like protocol will be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants