Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Add label/value option under format #293

Closed
justgrimes opened this issue Mar 13, 2014 · 17 comments
Closed

Add label/value option under format #293

justgrimes opened this issue Mar 13, 2014 · 17 comments

Comments

@justgrimes
Copy link
Contributor

It would be nice to add label and value under format. This would be useful if certain data files are zipped. If csv files are zipped then technically the format is zip i.e. "application/zip". With a label/value option under format you could identify and display the actual format of the contents of the file.

For example:

"distribution": [ {
    "accessURL": "http://www.foo.gov/csv.zip",
    "format": [{
        "label": "csv",
        "value": "application/zip"
        }]
    },{
    "accessURL": "http://www.foo.gov/shapefiles.zip",
    "format": [{
        "label": "shapefiles",
        "value": "application/zip"
        }]
    }
@mhogeweg
Copy link
Contributor

I propose continuing discussion aruond this as part of #291.

@philipashlock
Copy link
Contributor

I think that label would be misleading actually. It seems like the ideal scenario would be to make it clear that it's a zipped csv or a zipped shapefile. CKAN actually supports this natively with an additional field, but I don't know what convention would make sense in the context of POD/DCAT other than possibly introducing a new field.

For reference, in CKAN it's like this:

mimetype: standard mimetype (e.g. for zipped csv would be application/zip)
mimetype-inner: mimetype of innermost object (so for example would be text/csv)

@mhogeweg While there are parallels to the discussion about APIs and the different layers of media types there, I think this one use case for downloadable zipped files is common enough that it might make sense to address here separately.

@konklone
Copy link
Contributor

I had assumed that posting zip files as data was a bug, not something to be resolved via mimetype.

I couldn't find any discussion of the matter here on POD about posting links to ZIP files instead of the actual data. It seems like it falls into the same category as including a link to a landing page (HTML) where data is linked to somewhere. Under either case, you can no longer automatically download and index the data itself - you have to make some sort of intelligent (or crude) guess at what to do with the HTML or ZIP file you've been given.

@mhogeweg
Copy link
Contributor

The DOI data.json file is uncompressed about 110MB, while zipped up it is about 14% of that (15MB). zipping files will be a common practice for quite some time and it would be good to have a mechanism to describe the file and what's inside (which could be more than one file of different types).

I suggested combining it with the other discussion as I see this very similar to the distinction between a web service and the response you get back (a SOAP service returning a KMZ which is zipped KML file(s), or any other sample of service that's out there). something could return the results of a study as a pdf and one or more csv, kml, xls, shp files that all are zipped up in a convenient package.

@waldoj
Copy link

waldoj commented Mar 14, 2014

FWIW, the sitemap.xml spec allows for gzipping the file, which is simply indicated by renaming the file to sitemap.xml.gz. How the client knows to find it there without having to check two locations every time, I don't have the faintest idea.

@smrgeoinfo
Copy link
Contributor

so one distribution could be something like and open data package (http://dataprotocols.org/data-packages/). A MIME type for this might be sufficient, but if it were possible to add some properties on the link to help the user know a-priori what they're going to get inside, it would be a good thing.

@philipashlock philipashlock added this to the Next Version of Common Core Metadata Schema milestone Mar 14, 2014
@konklone
Copy link
Contributor

If we're just talking about zipping a large file, HTTP has compression negotiation built in to the transport layer.

So you can get the benefits of compression without needing to involve that optimization step in the data catalog at all, and without having to list or guess at multiple URLs.

@justgrimes
Copy link
Contributor Author

@konklone There are a couple of reasons why zip files might be used:

  • To collocate related data files where the content is split across multiple files (e.g. relational database extract; each table a csv)
  • To collocate data files where the data file format involves more than one file (e.g. "shapefiles" - shp, dbx, prj, dbf)
  • Compression to reduce file size
  • Compression as a form of data integrity (ensure file isn't corrupted)
  • Existing policies and practices (that supersede data.gov)

@philipashlock - A quick scan of data.gov shows that a number of datasets don’t resolve to their expected format. It seems that some people assumed format to be the contents and not the linked data file. Moreover, the use of zip files is pretty pervasive. They are the most popular format on data.gov (32% of all listed formats). Displaying this format does little to inform the user; if anything it obfuscates holdings. It’s particularly bad for data sets that have multiple formats, use distribution, and more than one file resolve to zip files. At best the user would have to mouse over and hope the agency incorporated content file type in the actual naming of the file.

I'm just wondering what might be the most helpful to display to the user. Not sure what the optimal solution is: add label/format; add mimetype-inner; better educate agencies as to what the format field should contain & recommend agencies don't use zip files.

@waldoj
Copy link

waldoj commented Mar 19, 2014

@konklone, I think that servers will only send gzipped or deflated content if the client indicates that it can receive those. cURL, if I understand its defaults properly, requires a --compress flag to send gzip,deflate as the Accept-Encoding value. This means that a human requesting data with a proper browser would get the automatically compressed version, but any cURL-based client would have to transfer the uncompressed data unless the implementor knew to specify --compress.

@konklone
Copy link
Contributor

The basic principle that I've assumed is at work here, is that someone (be it data.gov, or anyone) should be able to use an agency's data catalog to download each dataset and know, programmatically, how to point the contained data at a program that knows how to handle that mime-type.

At the very least, I'd assume the mime-type of the dataset in the catalog would be that dataset's semantic mime-type (what's inside the zip file). Zip is just an implementation detail. In other words, instead of making a new field, mimetype-inner, for the contents, make a new field or fields that describe whatever implementation details the publisher has used that's in the way of accessing the real data.

@philipashlock philipashlock modified the milestone: Next Version of Common Core Metadata Schema Apr 14, 2014
@gbinal
Copy link
Contributor

gbinal commented Jul 17, 2014

Some takeaways:

  • there should be guidance about whether format should reflect the inner or outer type (e.g. zip or the inner compressed file)
  • There seems to be buy-in to providing guidance for how advanced agencies could do this if they wanted to.
  • It's worth noting that the examples above don't necc. end up creating a new field, but instead allows for a more complex implementation of the `format' field.

@smrgeoinfo
Copy link
Contributor

more complex implementation of the 'format' field is asking people to pay attention to profile syntax details that are likely to be misunderstood or ignored. Seems like adding a field with some clear semantics would cause less trouble.

@haleyvandyck
Copy link
Contributor

So seems like it might make sense to actually add "mediatype" as a required-if-applicable field, and then move the current "format" field to optional. "format" can then be used as an additional opportunity to describe the format type in a human readable way, while using "mediatype" in a way that's more consistent with DCAT. Does that seem like a suitable solution?

@philipashlock philipashlock modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014
@gbinal
Copy link
Contributor

gbinal commented Jul 30, 2014

Good point, @smrazgs. I agree, @haleyvandyck.

@gbinal
Copy link
Contributor

gbinal commented Aug 22, 2014

@gbinal
Copy link
Contributor

gbinal commented Sep 8, 2014

This is addressed in 6c376cf

rebeccawilliams pushed a commit that referenced this issue Oct 2, 2014
Changes that still need to be addressed are changes in structure and should we add usage notes additions here or no?:

* Adds optional describedByType field at the dataset and distribution level (#291, #332)
* Changes contactPoint field to an object that contains the name (fn) and email address (hasEmail) (#358)
* Adds fn field as part of contactPoint replacing earlier use of contactPoint (#358)
* Changes publisher field to an object that allows multiple levels of organizations (#296)
* Changes accessURL field to represent indirect access and to exist only within distribution (#217, #335) 
* Changes format field to a human readable description and to exist only within distribution (#272, #293)
* Adds optional description field for use within distribution (#248)
* Adds optional title field for use within distribution (#248)
* Changes accrualPeriodicity field to use ISO 8601 date syntax (#292)
* Changes distribution field to become required-if-applicable and to always contain the accessURL or downloadURL fields (#217)
* Changes license field to be a URL (#196)
@gbinal
Copy link
Contributor

gbinal commented Nov 7, 2014

Thank you for driving the conversation around this issue and helping to assemble the v1.1 metadata update.

There appears to be strong consensus around this issue, which has been accepted in the v1.1 update and merged into Project Open Data. Project Open Data is a living project though. Please continue any conversations around how the schema can be improved with new issues and pull requests!

It's important for government staff as well as the public to continue to collaborate to make the Open Data Policy ever better. Though the v1.1 update is a substantial update, future iterations do not have to be, so whatever your ideas - big or small - please continue to work with this community to improve how government manages and opens its data.

@gbinal gbinal closed this as completed Nov 7, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants