Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a plan for documenting Babel provenance/versioning #205

Open
gaurav opened this issue Nov 26, 2023 · 2 comments
Open

Come up with a plan for documenting Babel provenance/versioning #205

gaurav opened this issue Nov 26, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@gaurav
Copy link
Collaborator

gaurav commented Nov 26, 2023

It may be possible to use Snakemake reporting to make this happen: https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html

Essentially, figure out how to embed either explicit version number (e.g. Download UMLS release 2023AB) or implicit version number (e.g. Downloaded the latest version of XYZ as of Nov 26, 2023) into the reports.

This may also help with Babel documentation (#148).

@gaurav
Copy link
Collaborator Author

gaurav commented Aug 6, 2024

The only provenance information that Babel currently has is various "intermediate" files (e.g. https://stars.renci.org/var/babel_outputs/2024mar24/intermediate/), each of which includes the mappings that we use to construct our cliques. It is generally true that no mapping information outside the intermediate files is used, and it is generally true that every intermediate file comes from a single source. It is not guaranteed that a particular mapping is only present in a single file, and the intermediate files include identifiers that are left out of the final cliques because those prefixes are not allowed by the Biolink model.

This provides a potential way to provide both a quick-and-dirty and a more long term way of assembling metadata:

  1. (Quick-and-dirty) We can manually come up with a JSON file that maps every intermediate file to its source.
  2. (Long term) We can modify every Snakemake target that produces an intermediate file so that it produces a source-file that sits beside the intermediate file. Then, at the end, we can collect all the intermediate-metadata files into a single JSON file that maps each intermediate file to its source. Each Snakemake target can pull information from its corresponding download file to include metadata, such as version number, last modified time, and download time.
  3. We load all of this mapping information into some kind of database.
  4. (Quick-and-dirty) We can provide a provenance API; for a given ID, we can:
    1. Ask NodeNorm for the full clique.
    2. We query the database to find every mapping from the preferred ID to every secondary ID, either directly or via intermediary identifiers.
    3. We return the list of all mappings along with their provenance information.
  5. (Long term) We incorporate the mapping information with provenance information into the cliques, store them in NodeNorm, and return them as needed.

@gaurav
Copy link
Collaborator Author

gaurav commented Oct 11, 2024

The key to understanding Babel provenance at the moment are the concord files: these provide mappings from one (or more) identifier systems to other identifier systems, and it's generally true that every mapping in Babel originates in one or more of these files. So here is a simple data model I propose for slowly introducing provenance information into Babel:

  1. Every compendium file except one (umls.txt) is created by a call to babel_utils.write_compendium(). I will modify this method to accept a list of concord files, from which concord metadata filenames can be inferred. It will use the information from these concord metadata files create a corresponding provenance/[compendium name].json file for every compendium it is called upon to create. This will initially be blank, but as more concord metadata files are filled in, this will begin to include increasingly complete provenance information.
    • We should also add a report method that reports on how complete each piece of provenance is, so we can work backwards to including provenance source information for the most important downloads first.
  2. I will modify every method that creates a concord file to create a corresponding concord metadata file as well (e.g. for babel_outputs/intermediate/taxon/concords/NCBI_MESH, this will be babel_outputs/intermediate/taxon/concords/NCBI_MESH.META.json). Initially, this will simply note the source names (NCBI, MESH) and the prefixes contained within the mapping file, but where available this will also include metadata from source files.
  3. I will modify every download method that is directly used in a concord to produce a download metadata file (e.g. babel_downloads/NCBITaxon/META.json). Initially, this will be created during the download and will include some hardcoded information about this knowledge source, but hopefully we can also use this to record version information, the URL being downloaded, a MD5 hash of the download file, statistics (number of entries/unique IDs/etc) and other useful information.

The goal is for download metadata information to flow into the concord metadata information, and from there into the compendium metadata information, so that the final compendium metadata information will become an increasingly accurate report on the data that went into it. This approach will also allow us to test out this approach and develop provenance data formats on the simpler compendia before working our way up to the larger and more complex compendia. It also allows us to do this work piecemeal, which could be helpful if our work is interrupted by other priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant