-
Notifications
You must be signed in to change notification settings - Fork 590
Clarify ownership and provenance for datasets listed in data.json #296
Comments
Interesting point--thanks for raising it. Do we think this is something we could issue guidance on for the Publisher field? or are you suggesting we consider a new field in the schema? |
@philipashlock This is the other example: http://hub.healthdata.gov/dataset/supplemental-nutrition-assistance-program-snap-data-system |
right great question. the short answer is no, a standard nomenclature/vocabulary for state/local government does not exist. there isn't really an authority to maintain such a lexicon (e.g. dept of ed would have lists of school districts, HHS have local health agencies etc, but not an comprehensive thing). i can't even think of an association which would have the current understanding (perhaps american planning association?). all this being said, i can see this as an outreach opportunity if someone wants to dive in. strategy might be (a) just use Census geography as the nomenclature (eg Census Places as local cities, census counties as all counties etc). this method is not comprehensive (would mis some local tax areas and miss council of governments etc, but a good start), (b) assign the task to FGDC to form a committee and make such a thing (this would be great for a standards body to take on, and FGDC could be one to ask (there could be others, but FGDC is a federal entity and i assume OMB might be able to direct them to do something), (c) ask an association for some in kind help. in the geo world, NSGIC (national states GIS council) could be one, the american planning association could be one, might be some others. the short is, there isn't one today and building it is likely not a small task, depending on exactly how deep we want to go. sorry. |
@haleyvandyck - to your question:
I'd suggest the former. I think the schema is robust enough at the moment, so the next stage would be to outline a proposal for guidance on this topic. |
I am adding notes from Thursday's Common-Core Metadata Schema Review (see #325) where the group agreed there was value in accommodating more information about hierarchy (as well as primary source guidance) in the Publisher field. The discussion noted that some Publisher information is less likely to change (e.g. Bureaus) and that due to potential Publisher changes flexibility should be favored. The discussion also noted that hierarchical information might be best collected with additional designated fields. The group outlined three possible resolutions:
|
One option for clarifying guidance would be to provide very specific guidance about what to put in the string, including a clear vision for how to provide multiple levels of publisher, separated by commas. I'll take a pass at this. |
If we were to align the |
Here's a suggestion for updated Usage Notes for the publisher field:
The example can be updated to be:
|
Ah - sorry for missing your comment, Phil. I'd assumed that it had to stay a string. What would my above example (Office of Duty Operations) look like in that case? |
@gbinal Having a longer string for
In fully expanded form, it can look like: {
"publisher": {
"name": "Office of Duty Operations",
"subOrganizationOf": {
"name": "International Trade Administration",
"subOrganizationOf": {
"name": "U.S. Department of Commerce"
}
}
}
} However, you would probably never do that, as it has the same issues of the string version, in that there can be inconsistency in how the {
"publisher": "http://example.org/publishers/1.json"
} Or include a name but use a URL for {
"publisher": {
"name": "Office of Duty Operations",
"subOrganizationOf": "http://example.org/publishers/2.json"
}
} |
@feomike The UK did your option (b) and produced the RDF that powers this visualization: http://data.gov.uk/organogram/cabinet-office For option (a), I'd look at OpenCivicData's Division Identifiers, which establish stable identifiers for geographies. Examples:
|
For a Linked Data route, I think there would be just two fields in our schema. Where possible, I would be inclined to refer to the wikipedia page (there is not one for this office of duty ops) for a number of reasons:
|
@lilybradley I think the question of linking the publisher to some common identifier (like a Wikipedia URL) is a separate question from how to model the organizational hierarchy that the publisher is part of. For linking, |
Thanks for the background and examples @jpmckinney, it's immensely helpful. For implementers of the Project Open Data schema, I think the fully expanded example you provided is more realistic than pointing to a meaningful URL, but we can also help maintain consistency by enforcing a little validation. I think we'd probably want to require the use of the Another example of identifiers for government organizations is publicbodies.org which might have some overlap with OCD IDs, but is more focused on organizational units rather than geographies. @project-open-data, @opencivicdata, and @unitedstates might also be interested in doing more based on the various internal id mappings like the one from OMB - #341 |
Here's an example of what this might look like: |
This is addressed in 3789c65 |
Changes that still need to be addressed are changes in structure and should we add usage notes additions here or no?: * Adds optional describedByType field at the dataset and distribution level (#291, #332) * Changes contactPoint field to an object that contains the name (fn) and email address (hasEmail) (#358) * Adds fn field as part of contactPoint replacing earlier use of contactPoint (#358) * Changes publisher field to an object that allows multiple levels of organizations (#296) * Changes accessURL field to represent indirect access and to exist only within distribution (#217, #335) * Changes format field to a human readable description and to exist only within distribution (#272, #293) * Adds optional description field for use within distribution (#248) * Adds optional title field for use within distribution (#248) * Changes accrualPeriodicity field to use ISO 8601 date syntax (#292) * Changes distribution field to become required-if-applicable and to always contain the accessURL or downloadURL fields (#217) * Changes license field to be a URL (#196)
This issue ended up focusing on providing more clarity for dataset provenance using the I think there might have been discussion in another issue about some cases where there could be grey areas, like a cross agency partnership where multiple agencies worked together on something that produced a dataset. I don't have a specific example of that, but if someone has one perhaps we can test whether there would be justification for publishing the metadata in multiple agencies' data.json rather than determining a single owner. That said, I think it's pretty clear that datasets published by other governments, like cities and states, should not be included in an agencies data.json. Likewise, if a city or state were to implement the data.json spec, it'd be best if they provided a version that was exclusively limited to their own datasets rather than including datasets aggregated from other levels of government. |
Thank you @philipashlock for pointing me to this thread. @gbinal has posted a link to example json with publisher tag with a nested subOrganization element. My suggestion: It is easier to parse an array data that represents a hierarchical relationship than infinitely nested hierarchical relationship. Therefore, change the schema to represent the hierarchy as array as originally suggested by @rebeccawilliams on Jul 21 and @gbinal on Aug 13. I believe creating a "model" class with "publisher" element as array can be parsed by any JSON parser with single line of code whereas having recursively nested subOrganization element will require "special handling". |
Using a full publisher class provides a lot more information at minimal cost. I think you might be overestimating how much is needed for the "special handling" Here's a comparison of what's needed to process the two approaches, it's not a huge difference: Nested objects:
Or one simple array:
|
There is also the minor detail of information loss in the array encoding approach... |
Thank you @philipashlock for the response with code example showing how to extract the attribute values. Regarding loss of information: the hierarchical array does not have to be array of strings but can be array of object with any additional attribute for the organization. |
@rrmishra The current proposal is based on using FOAF and ORG, so we're not inventing our own new way to represent these terms and relationships. As @jpmckinney suggested, this could be done by referencing an external URI and JSON representation of each organization object but I don't think it's realistic to expect most federal agencies to do that in the near future and it's trivial to include the additional organization objects inline. If an agency did want to represent It seems like what you're suggesting is to provide an array of organization objects. I think this would deviate from the DCAT standard we're trying to adhere to and based on the example you provided earlier it also wouldn't convey any relationships between the organizations. How do you suggest representing this as one publisher with a relationship to other organizations rather than multiple publishers? Also, can you provide a real use case where it's non-trivially more burdensome to process nested organization objects rather than an array of organization objects? |
FWIW, I'm def. a fan of simplicity in the schema but am convinced of the greater benefit that comes by adhering to norms and standards, in this case, FOAF, ORG, and DCAT. In addition to the followup questions @philipashlock poses above, I'd also ask whether there's a compelling example of the array setup following data standards better. |
I think there are a few different issues that came out of this discussion:
I think we've addressed the first issue here, but as I commented before, #296 (comment), I don't think we've fully addressed the third point. This issue isn't really part of the schema, but rather the broader guidance so I've gone ahead and created a new issue for it - #390 We also haven't addressed the second point on unique identifiers for the publisher (other than what's already accomplished by |
Sounds great. Thanks a bunch for parsing these issues. I'll go ahead and close this issue in the meantime. |
As @JoshData and @lilybradley have mentioned, the data.json from HHS includes data aggregated from State governments as well (here's an example from ny.gov). Does Project Open Data already have a clear requirement that datasets should only be those produced by the agency? If not, should that requirement be better specified?
I'm sure there are a lot of gray areas here where a local government has produced some data in partnership with a federal agency, but if we can provide better guidance on those scenarios (eg the data.json should only include data hosted on a federal .gov) then that would be helpful.
If other sources can be included, we'll want a better method to identify the source of these datasets. If each federal dataset listed the
programCode
andbureauCode
as required it would be easy to filter out those that don't have them, but we'd still want a consistent and detailed way to identify those other sources.The text was updated successfully, but these errors were encountered: