-
Notifications
You must be signed in to change notification settings - Fork 590
Add label/value option under format #293
Comments
I propose continuing discussion aruond this as part of #291. |
I think that label would be misleading actually. It seems like the ideal scenario would be to make it clear that it's a zipped csv or a zipped shapefile. CKAN actually supports this natively with an additional field, but I don't know what convention would make sense in the context of POD/DCAT other than possibly introducing a new field. For reference, in CKAN it's like this:
@mhogeweg While there are parallels to the discussion about APIs and the different layers of media types there, I think this one use case for downloadable zipped files is common enough that it might make sense to address here separately. |
I had assumed that posting zip files as data was a bug, not something to be resolved via mimetype. I couldn't find any discussion of the matter here on POD about posting links to ZIP files instead of the actual data. It seems like it falls into the same category as including a link to a landing page (HTML) where data is linked to somewhere. Under either case, you can no longer automatically download and index the data itself - you have to make some sort of intelligent (or crude) guess at what to do with the HTML or ZIP file you've been given. |
The DOI data.json file is uncompressed about 110MB, while zipped up it is about 14% of that (15MB). zipping files will be a common practice for quite some time and it would be good to have a mechanism to describe the file and what's inside (which could be more than one file of different types). I suggested combining it with the other discussion as I see this very similar to the distinction between a web service and the response you get back (a SOAP service returning a KMZ which is zipped KML file(s), or any other sample of service that's out there). something could return the results of a study as a pdf and one or more csv, kml, xls, shp files that all are zipped up in a convenient package. |
FWIW, the sitemap.xml spec allows for gzipping the file, which is simply indicated by renaming the file to |
so one distribution could be something like and open data package (http://dataprotocols.org/data-packages/). A MIME type for this might be sufficient, but if it were possible to add some properties on the link to help the user know a-priori what they're going to get inside, it would be a good thing. |
If we're just talking about zipping a large file, HTTP has compression negotiation built in to the transport layer. So you can get the benefits of compression without needing to involve that optimization step in the data catalog at all, and without having to list or guess at multiple URLs. |
@konklone There are a couple of reasons why zip files might be used:
@philipashlock - A quick scan of data.gov shows that a number of datasets don’t resolve to their expected format. It seems that some people assumed format to be the contents and not the linked data file. Moreover, the use of zip files is pretty pervasive. They are the most popular format on data.gov (32% of all listed formats). Displaying this format does little to inform the user; if anything it obfuscates holdings. It’s particularly bad for data sets that have multiple formats, use distribution, and more than one file resolve to zip files. At best the user would have to mouse over and hope the agency incorporated content file type in the actual naming of the file. I'm just wondering what might be the most helpful to display to the user. Not sure what the optimal solution is: add label/format; add mimetype-inner; better educate agencies as to what the format field should contain & recommend agencies don't use zip files. |
@konklone, I think that servers will only send gzipped or deflated content if the client indicates that it can receive those. cURL, if I understand its defaults properly, requires a |
The basic principle that I've assumed is at work here, is that someone (be it data.gov, or anyone) should be able to use an agency's data catalog to download each dataset and know, programmatically, how to point the contained data at a program that knows how to handle that mime-type. At the very least, I'd assume the |
Some takeaways:
|
more complex implementation of the 'format' field is asking people to pay attention to profile syntax details that are likely to be misunderstood or ignored. Seems like adding a field with some clear semantics would cause less trouble. |
So seems like it might make sense to actually add "mediatype" as a required-if-applicable field, and then move the current "format" field to optional. "format" can then be used as an additional opportunity to describe the format type in a human readable way, while using "mediatype" in a way that's more consistent with DCAT. Does that seem like a suitable solution? |
Good point, @smrazgs. I agree, @haleyvandyck. |
Here's an example of what this might look like: |
This is addressed in 6c376cf |
Changes that still need to be addressed are changes in structure and should we add usage notes additions here or no?: * Adds optional describedByType field at the dataset and distribution level (#291, #332) * Changes contactPoint field to an object that contains the name (fn) and email address (hasEmail) (#358) * Adds fn field as part of contactPoint replacing earlier use of contactPoint (#358) * Changes publisher field to an object that allows multiple levels of organizations (#296) * Changes accessURL field to represent indirect access and to exist only within distribution (#217, #335) * Changes format field to a human readable description and to exist only within distribution (#272, #293) * Adds optional description field for use within distribution (#248) * Adds optional title field for use within distribution (#248) * Changes accrualPeriodicity field to use ISO 8601 date syntax (#292) * Changes distribution field to become required-if-applicable and to always contain the accessURL or downloadURL fields (#217) * Changes license field to be a URL (#196)
Thank you for driving the conversation around this issue and helping to assemble the v1.1 metadata update. There appears to be strong consensus around this issue, which has been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests! It's important for government staff as well as the public to continue to collaborate to make the Open Data Policy ever better. Though the v1.1 update is a substantial update, future iterations do not have to be, so whatever your ideas - big or small - please continue to work with this community to improve how government manages and opens its data. |
It would be nice to add
label
andvalue
underformat
. This would be useful if certain data files are zipped. If csv files are zipped then technically theformat
is zip i.e. "application/zip". With a label/value option under format you could identify and display the actual format of the contents of the file.For example:
The text was updated successfully, but these errors were encountered: