-
Notifications
You must be signed in to change notification settings - Fork 91
Document Python Repo API #1185
Comments
Hmm, yeah the references aren't added directly by the repo manager and may have individual accession codes that can't be inferred from the fasta file. Ideas on how to mitigate this? For passing compliance we could manually modify the SQL table, but essentially we need to be able to update records that don't have a repo manager endpoint. We could make add-referenceset interactive where it asks for accessions for each reference it adds as it iterates through them. `Reference "chr1" found, would you like to set optional values for this reference?" We could make add-referenceset take a data structure that maps reference names to accession arrays. It's important that we have tests internal to the server for this case. |
We should obtain all information from this file has all the accessions, names, English description. This is the official source. I sent you a python package I wrote to parse it. |
While it's not blocking sql_repo any more (we worked around it), this metadata should be getting properly ingested and assigned ASAP. This is presumably for GRCh37. Is there a matching metadata file for 38? What's the pattern in general? Also, @diekhans could you paste a link to the python code you mentioned in this issue? I'm not seeing it in my email or slack. |
The solution was to use the Python API for the repo manager. The CLI doesn't need to offer this functionality, I think we can close this issue and offer via documentation the more complicated edge cases for setting fields of underlying reference objects. Opinions? @macieksmuga From
|
That's a fine solution @david4096 - Is it captured in server setup documentation already? If not, let's do so and close this out. |
This is a tricky one, as we don't add the references directly as @david4096 pointed out. However, setting sourceAccessions and other metadata for reference objects is something we need to enable, and asking the admin to use the internal Python API isn't very user friendly. We will need to have an 'update' interface at some point, which allows the admin edit the attributes of objects in place within the DB. If the admin wants to change the description of a dataset containing 10K ReadGroupSets, it's not reasonable to ask them to delete it, recreate it and then re-add all of the ReadGroupSets. We could make this interface fairly low-level, using JSON as the input, so something like ga4gh_repo update-reference [referenceSetName] [referenceName] "{'sourceAccessions':['AC1', 'AC2']}" Or, we could do ga4gh_repo update [referenceId] "{'sourceAccessions':['AC1', 'AC2']}" Thoughts? |
I am going to remove this issue from the SQL milestone as it is not blocking the release. I am worried that the idea of adding an "update" interface is going to add a lot of complexity to the code. I would like to just use the pattern of "remove and re-add" for the files that need to be updated, and use that procedure for as long as we can. |
I agree @kozbo, we should stick with remove-and-readd for now. It's only a matter of time before users will complain though, as removing and readding 10K readgroupsets just so you can change the name or ncbiTaxonId of a reference set is pretty user-unfriendly. |
I think that after the ORM PR is in place, we should reevaluate how the python repo API is used and remove redundant data representations (why does it load things into memory)? To close this issue, we should present the docstrings for the Python repo in addition to the current repo CLI documentation. I would like to make closing this dependent on #1528 before closing. That will change the order in which items will need to be added to a repository. |
Given the vastly improved method for importing reference sets into the SQL repo, the current system does not have a mechanism to populate the
accessions
field for individual references (it's not even clear to me what the source of thisaccessions
field should be - per discussion in ga4gh/ga4gh-schemas#518 (comment) )For now, several key compliance tests for references,
checkReferenceFoundByAccession
, andcheckReferenceFoundByMD5Checksum
, are failing as a result of this gap.The text was updated successfully, but these errors were encountered: