-
Notifications
You must be signed in to change notification settings - Fork 117
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Package Bundling (and maybe compression) #132
Comments
I was thinking a specification which would tell how to intepret a zipped package on the fly, in the same way a JAR is executed by Java.
|
@sabas i think this makes a lot of sense. Do you want to start speccing something out? |
See #198 |
There was a lot of discussion in the PR. The PR basically suggested tar + gzip. Subsequent discussion in the PR suggested reviewing existing best practice more and using zip. Main excerpts: @mfenner wrote:
Excerpt from Research Object bundle spec:
@tfmorris wrote:
|
@mfenner would you be interested in taking a bit of editorship here? You were a strong proponent of introducing this (and I'm +1 too). In addition, this should be very simple and short spec to write once we decide what to do. |
Let me think about how to approach this. |
@mfenner any further thoughts? /cc @danfowler I am increasingly thinking that "bundling" a data package into one file (compressed) is an important use case and would love your suggestions here. |
@rgrp sorry for not following up on this. I want a standard zip compression, and hadn't found the time to spec out the details. Bundling a data package into one file is an important use case for me. |
For reference (although not directly related to a spec for compression) we went ahead and added zip support to the recently upgraded Python lib for DataPackage, based on very clear use cases in the CKAN integration, and, in general, that it is sensible and reasonable :). @vitorbaptista developed and led on that initiative. For reference: |
@mfenner i imagine this can be super simple. Would you be able to start a draft and drop it in an issue here? @vitorbaptista useful to get outline of what you did. |
The requirements for my ZIP file loading were to be able to load both ZIPs that follow the pattern:
and also
This is because we wanted to support the ZIP files generated by GitHub (i.e. https://github.com/datasets/gdp/archive/master.zip), which have all contents inside a folder. The actual code checks that the ZIP file has only and only one |
+1 Makes a lot of sense. |
I just hope you awere of ZIP filename encoding problems: http://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/ Lot of users still stick to windows-1251 (cyrillic) or SHIFT_JIS (japanese). Maybe it would be good idea to pick archive format that doesn't have such desing flaw (if such format exists)? |
That blog post is from 2008, is barely coherent, and seems focused more on
the tools than the format.
What do you recommend instead of ZIP?
|
@mfenner are you happy to draft a mini-spec here? I imagine it could be just a few paragraphs saying e.g.
|
I wouldn't limit it to
I would suggest us to follow the 3rd option, as it's both easier to code and to explain. |
I think is better to be explicit in this case and limit the options for people. A single |
Option 1 would enforce the rules used by the |
@tfmorris I propose 7zip as its open-source, provide better compression ration and UTF-8 file-names. Despite 2008 is far away, problems with i18n in filesystems is the same - ZIP file created on PC with Korean locale and contain Korean in filenames will be unreadable gibberish after unZIPing on PC with different locale. |
For reference, BagIt's serialization specification work doesn't actually mandate a given format, just rules for (de)serializing behavior:
|
@mfenner are you still interested to work on a mini spec for this? |
Having read the BagIt approach I think they got it pretty much right. My only question would be about step 3 - we could have instead that you do it in the datapackage directory so that the datapackage.json is at the root of the archive file. However, my guess is that bagit creators thought about this. Next steps:
|
@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337 |
@pwalsh yes - note this is a patterns item at this stage. It won't be part of the spec atm i think. |
tar + zstd are great for this purpose. Zstd is superior to gzip/zlib. Tools exist and are available on permissive license (BSD).
related topic #290 (comment) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Updated: 2016-11-17
We want a way to "bundle" a data package into a single file for transmission. In addition it may be compressed at the same time.
Note also that individual resources can be compressed in themselves - see #290
Desired Features
Original Description
As other packaging types use compression for distributing each package (JAR is a ZIP archive), there should be a section proposing a way to deal with compressed data packages.
The text was updated successfully, but these errors were encountered: