Come up with a plan for documenting Babel provenance/versioning #205

gaurav · 2023-11-26T15:17:20Z

It may be possible to use Snakemake reporting to make this happen: https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html

Essentially, figure out how to embed either explicit version number (e.g. Download UMLS release 2023AB) or implicit version number (e.g. Downloaded the latest version of XYZ as of Nov 26, 2023) into the reports.

This may also help with Babel documentation (#148).

The text was updated successfully, but these errors were encountered:

gaurav · 2024-08-06T03:58:58Z

The only provenance information that Babel currently has is various "intermediate" files (e.g. https://stars.renci.org/var/babel_outputs/2024mar24/intermediate/), each of which includes the mappings that we use to construct our cliques. It is generally true that no mapping information outside the intermediate files is used, and it is generally true that every intermediate file comes from a single source. It is not guaranteed that a particular mapping is only present in a single file, and the intermediate files include identifiers that are left out of the final cliques because those prefixes are not allowed by the Biolink model.

This provides a potential way to provide both a quick-and-dirty and a more long term way of assembling metadata:

(Quick-and-dirty) We can manually come up with a JSON file that maps every intermediate file to its source.
(Long term) We can modify every Snakemake target that produces an intermediate file so that it produces a source-file that sits beside the intermediate file. Then, at the end, we can collect all the intermediate-metadata files into a single JSON file that maps each intermediate file to its source. Each Snakemake target can pull information from its corresponding download file to include metadata, such as version number, last modified time, and download time.
We load all of this mapping information into some kind of database.
(Quick-and-dirty) We can provide a provenance API; for a given ID, we can:
1. Ask NodeNorm for the full clique.
2. We query the database to find every mapping from the preferred ID to every secondary ID, either directly or via intermediary identifiers.
3. We return the list of all mappings along with their provenance information.
(Long term) We incorporate the mapping information with provenance information into the cliques, store them in NodeNorm, and return them as needed.

gaurav · 2024-10-11T14:21:52Z

The key to understanding Babel provenance at the moment are the concord files: these provide mappings from one (or more) identifier systems to other identifier systems, and it's generally true that every mapping in Babel originates in one or more of these files. So here is a simple data model I propose for slowly introducing provenance information into Babel:

Every compendium file except one (umls.txt) is created by a call to babel_utils.write_compendium(). I will modify this method to accept a list of concord files, from which concord metadata filenames can be inferred. It will use the information from these concord metadata files create a corresponding provenance/[compendium name].json file for every compendium it is called upon to create. This will initially be blank, but as more concord metadata files are filled in, this will begin to include increasingly complete provenance information.
- We should also add a report method that reports on how complete each piece of provenance is, so we can work backwards to including provenance source information for the most important downloads first.
I will modify every method that creates a concord file to create a corresponding concord metadata file as well (e.g. for babel_outputs/intermediate/taxon/concords/NCBI_MESH, this will be babel_outputs/intermediate/taxon/concords/NCBI_MESH.META.json). Initially, this will simply note the source names (NCBI, MESH) and the prefixes contained within the mapping file, but where available this will also include metadata from source files.
I will modify every download method that is directly used in a concord to produce a download metadata file (e.g. babel_downloads/NCBITaxon/META.json). Initially, this will be created during the download and will include some hardcoded information about this knowledge source, but hopefully we can also use this to record version information, the URL being downloaded, a MD5 hash of the download file, statistics (number of entries/unique IDs/etc) and other useful information.

The goal is for download metadata information to flow into the concord metadata information, and from there into the compendium metadata information, so that the final compendium metadata information will become an increasingly accurate report on the data that went into it. This approach will also allow us to test out this approach and develop provenance data formats on the simpler compendia before working our way up to the larger and more complex compendia. It also allows us to do this work piecemeal, which could be helpful if our work is interrupted by other priorities.

gaurav added this to the Babel January 2024 milestone Nov 26, 2023

gaurav modified the milestones: Babel January 2024, Babel July 2024 May 19, 2024

gaurav mentioned this issue May 23, 2024

Develop an RDBMS schema for Babel outputs #281

Open

gaurav mentioned this issue Jun 12, 2024

Replace the in-memory glomming code with DuckDB databases #291

Open

gaurav modified the milestones: Babel July 2024, Babel Translator Hammerhead Aug 6, 2024

gaurav mentioned this issue Oct 16, 2024

Adding SSSOM export #328

Draft

gaurav modified the milestones: Babel Translator Hammerhead, Babel November 2024 Oct 21, 2024

gaurav added the documentation Improvements or additions to documentation label Oct 21, 2024

gaurav modified the milestones: Babel November 2024, Babel January 2025 Oct 21, 2024

This was referenced Dec 2, 2024

Export explanations for cliques as an intermediate output directly from the intermediate files #373

Open

Export SSSOM intermediate files with information on the type of provenance, not just the provenance itself #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a plan for documenting Babel provenance/versioning #205

Come up with a plan for documenting Babel provenance/versioning #205

gaurav commented Nov 26, 2023

gaurav commented Aug 6, 2024

gaurav commented Oct 11, 2024

Come up with a plan for documenting Babel provenance/versioning #205

Come up with a plan for documenting Babel provenance/versioning #205

Comments

gaurav commented Nov 26, 2023

gaurav commented Aug 6, 2024

gaurav commented Oct 11, 2024