-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WACZ Aggregation / Multi WACZ Specification #112
Comments
I wonder if this should be a separate specification, to avoid confusion. It is no longer a WACZ, which is a specific format, but a collection of WACZ, probably would be just a .json file (the datapackage.json) to start, which is a different specification, but of course depends on the WACZ. The collection spec is sort of independent from the file format spec imo. |
I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Aggregation without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications? |
Well, it's really a completely different format, an aggregate format of wacz (and possibly other types) to form a collection. Maybe it should really be called a 'web archive collection', which consists of collection-level data. For example, maybe a collection could have a mix of wacz and regular warc files, which could be both be listed in the resources section. As conceived now, it's just a single json file, though could also change. |
Another aspect that the 'WACZ aggregate' could have is a page-to-wacz mapping, along with resources, for example:
This would help route the page to the correct wacz file, otherwise, would need to search through all of them.. |
I like WACZ Aggregation better than WACZ Collection, and have just updated some of the text above. |
I guess my point is that if someone is building a WACZ client, as we want people to do, they will want to be clear about what their viewer needs to do. I think the easiest way to communicate that is with a single specification about what WACZ support means. If implementors need to digest multiple specifications I think we will risk losing them. |
Hi. Maybe something like a iiif collection manifest (so just a json or
json-ld) might make sense here?
On Wed, Mar 9, 2022 at 12:19 AM Ilya Kreymer ***@***.***> wrote:
I don't think it should be separate. Unless I am misunderstanding (which
is very possible) an understanding of a WACZ Collection without an
understanding of WACZ would be pretty much useless. Why make it more
difficult to manage as two separate specifications?
Well, it's really a completely different format, an aggregate format of
wacz (and possibly other types) to form a collection.
The only overlap is that currently thinking of this also as a frictionless
data package (though maybe that doesn't make sense given that it uses a
url instead of an internal path).
Maybe it should really be called a 'web archive collection', which
consists of collection-level data.
For example, maybe a collection could have a mix of wacz and regular warc
files, which could be both be listed in the resources section. As conceived
now, it's just a single json file, though could also change.
—
Reply to this email directly, view it on GitHub
<https://github.com/webrecorder/wacz-spec/issues/112#issuecomment-1062514371>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ5VHUN5EOD3PCCAGPLU7AKDTANCNFSM5QCZLZ5A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
Hm, that's an interesting idea.. yeah, its basically a collection manifest for multiple WACZs (and possibly WARCs), maybe that's the best way to look at it.. |
That's a useful comparison @DiegoPino! It might be helpful to think about the WACZ specification as the equivalent of the Image API and this Aggregated WACZ view as the equivalent of the Presentation API. I could even imagine expressing the Aggregated WACZ as a IIIF Manifest since the IIIF Presentation API is oriented around the abstract idea of Canvases that can include images, video, audio. Why not a web archive too? But I fear that this might be a bit of a leaky abstraction because the sequencing of WACZs doesn't make a lot of sense in this context? Also the WACZ itself contains a lot of presentation metadata itself. While the separation between Image and Presentation APIs in IIIF allows it to be more general I also think it was because they were developed separately in time. As someone who had to implement support for them at one point I can say that understanding and tracking them as separate specifications sometimes proved challenging. But I suspect other people may feel differently about that. I think what you've identified here is that there's a bit of slippage between the WACZ media type (the ZIP file on the web) and WACZ as an API, which we see developing in tools like Browsertrix Cloud and which relates to other work like WASAPI. I also think this crops up in the Unzipped WACZ use case... |
I noticed an inconsistency in the Frictionless DataPackage specification around URLs in Resources. The Data Resource specification says:
whereas another Data Package specification distinguishes between |
Yes, that may be a good comparison. Thinking about it more, I think aggregation should really be a separate spec, but in this repo for now, because:
I think it'll be really confusing to combine the more experimental aggregation format into the core format specification, which is already in use. I think aggregations should be a separate file, similar to use-cases, and probably have its own version for now - I imagine it'll need a bit more iteration before its put into use.
|
Not to the expand the immediate scope, but maybe this really needs its own repo, wacz-aggregation-spec.
|
If aggregation isn't core to using WACZ then I agree it should be a separate specification, or perhaps just be in the description of how a specific tool works, and not a standard at all. The purpose of positioning WACZ as a standard is to help others who are creating tools that use WACZ (crawlers, viewers, indexers, aggregators, etc). The specification should document the minimal requirements that allow developers to do that work, and also provide flexibility for them to add things that they need. If it is to be a standard it's important that the WACZ specification not turn into a detailed description of how one specialized implementation works. In order to encourage adoption it's also important not to make understanding WACZ too difficult with multiple interlocking specifications. Having one, concise description that covers the core use case of viewing a WACZ would help address these concerns. I just wanted to go on record with that recommendation. I personally don't think a changing WACZ specification is a concern at this stage since it isn't a standard yet, and we actually want it to change! |
After some conversation @ikreymer decided that the Multi-WACZ or WACZ Aggregations is best served by a separate specification. This issue can stay open until that new repo & specification exist. |
Wanted to come back to this issue, now that we have all the specs in this issue, and list a new use case, and that is multiple WACZ files grouped together, for web-replay-gen to be viewed as part of a static site, but not necessarily merged together. I think we really need a JSON schema / format that covers several use cases of grouping and declaring WACZ files + and collections We now have at least the following use cases:
The schema should probably support collection level metadata (title, description, etc...) as well as an optional list of pages? Also, may be useful to be able to declare list of WACZ files via some path prefix, eg.
which might tell the tool reading the file to get all the WACZ files in the path prefix. (Will probably want to support different URL schemes for this, including http, s3, local, ipfs, etc...) It may make sense to split this issue into multiple ones as well, but wanted to jot this down here for now :) |
Hello 👋 ! @ikreymer brought this draft spec to my attention, because it touches on some of the threads I am currently pulling, compiling large collections using a single WACZ file. I think this draft is excelent, @edsu! I have a few questions/suggestions, in no particular order. Properties of
|
Details about how to aggregate multiple WACZ files into a single WACZ need to be added to the specification. This hinges on resources in the
datapackage.json
using aurl
for a WACZ rather than apath
. See the Resource Information section in the Data Package specification for details:There should also be a Data Package
profile
so that clients can easily distinguish between collections and regular WACZ files. PerhapsWACZ-Aggregation
?The specification should document that WACZ users MAY want to use the
data-package.json
as a place to record additional metadata about crawls. See the browsertrix-cloud API for examples.The text was updated successfully, but these errors were encountered: