Standardizing metadata
output
#58
Replies: 4 comments 20 replies
-
I believe @newsroomdev said @pickoffwhite said some assets may belong to multiple cases; in that case, perhaps the case_num should be a list, and the first entry should be used as the definitive location for saving ... ? (If "first" we should explain ... how "first" is determined.)
We have no metadata schema for case information. At an initial pass, I lumped all that into the metatdata details section, which is rather repetitive but not the end of the world. If we go that way, I would recommend using a "case_" prefix for those variables such that it's clear whether things like I don't know if those other "details" sections should be standardized in some way. We should be clear about whether |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
In some cases, especially with smaller departments, the site will mention that the case is being investigated by a different agency. Nauman highlighted the example of Fremont PD, see the cases for 2018. In these cases, should we include something like:
As one of the details? It would be connected to each asset. It's not directly connected to the scraping/archival, but might aid downstream file/case organization work. |
Beta Was this translation helpful? Give feedback.
-
EDITED 08/13:
case_num
>>case_id
for consistencyHello!
As part of our ongoing efforts to maintain consistency and reliability with our data, we need to standardize the
metadata
dictionary thatscrape_meta
generates. Adopting a set specification with tests allows downstream consumers (i.e. scraper orchestration servers and data analysts) to store and retrieve assets from disparate websites.This is an archival project in many ways. The following proposal aims to codify existing patterns while adapting to various websites contributors encounter.
Background
Goal: Streamline contributing and add more informative pull request checks. By adopting specific
metadata
key-value pairs, separate codebases can read the information and handle scraper orchestration and downloading.The first step is to streamline contribution by creating a
Site
class with a publicscrape_meta
method. Please see https://github.com/biglocalnews/clean-scraper/blob/dev/docs/decisions/00-deprecate-scrape.md for more information. This method produces a Python list of dictionaries. Each dictionary contains shared, required information and additional distinctive information.Proposal
Here is an example of expected JSON output going forward.
Required:
asset_url
to download the respective assetscase_id
to properly store the assetAdditional information, like
parent_page
is saved via a persistent storage server for bookkeeping. Please usedetails
for further information to record.Field Descriptions:
asset_url (str)
: The URL to the asset. This is the most important field, ensuring we know what to download.case_id (str)
: A unique identifier for the case.name (str)
: The asset's file name.parent_page (str)
: The asset's parent page file path.title (str)
: A title for the asset.details (dict)
: An object containing additional details. Examples:Why?
Standardizing helps in several ways:
Validation
If this proposal is adopted, pull request checks can test additional Site modules for the
scrape_meta
method and whether it returns a list of dictionaries that includeasset_url
andcase_id
. Additional options can be passed to these tests to avoid long-running GitHub Actions tasks while testingscrape_meta
Currently, contributors can check types before opening pull requests via pre-commits and/or VSCode plugins
VSCode plugins
Here are a few that I use. If adopted, we can merge in workplace settings with these plugins
Feedback/Comments
Thank you for your help in scaling up
clean-scraper
!Best regards,
Gerald Rich
@biglocalnews
Beta Was this translation helpful? Give feedback.
All reactions