-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #5 from wtsi-npg/devel
pull from devel to master to create the first release
- Loading branch information
Showing
8 changed files
with
295 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,65 @@ | ||
# npg_id_generation | ||
|
||
An API used to generate product IDs, which are hashes of the JSON representation of an object. | ||
An API used to generate product IDs, which are hashes of the JSON representation | ||
of an object. | ||
|
||
For different sequencing platforms different sets of identifiers might be used to | ||
fully describe the origin of data. For reasons of efficiency and interobility | ||
between different systems it is sometimes desirable to be able to use a single | ||
identifier, which will be unique not only within data for a single platform, | ||
but also between different platforms. | ||
|
||
In the Sanger Institute run ID, lane number and numerical tag index are used | ||
as identifiers for the Illumina platform. Historically, the first algorithm for | ||
generating unique identifiers was implemented in Perl for the Illumina platform, | ||
see [documentation](https://github.com/wtsi-npg/npg_tracking/blob/master/lib/npg_tracking/glossary/composition.pm | ||
). | ||
|
||
Later a need to have a similar API for other sequencing platforms arose. This | ||
package implements a Python API. The attributes of objects are sequencing | ||
platform specific. The generator for the PacBio platform is implemented by the | ||
`PacBioEntity` class. | ||
|
||
Examles of generating IDs for PacBio data: | ||
|
||
``` | ||
from npg_id_generation.main import PacBioEntity | ||
# from a JSON string via a class method | ||
test_case = '{"run_name": "MARATHON","well_label": "D1"}' | ||
print(PacBioEntity.parse_raw(test_case, content_type="json").hash_product_id()) | ||
# by setting object's attributes | ||
print(PacBioEntity(run_name="MARATHON", well_label="D1").hash_product_id() | ||
# sample-specific indentifier | ||
# for multiple tags a sorted comma-separated list of tagscan be used | ||
print(PacBioEntity(run_name="MARATHON", well_label="D1", tags="AAGTACGT").hash_product_id() | ||
``` | ||
|
||
All generators should conform to a few simple rules: | ||
|
||
1. Uniqueness of the ID should be guaranteed. | ||
2. The ID should be a 64 characher string. | ||
3. It should be possible to generate an ID from a JSON string. | ||
4. The value of the ID should **not** depend on the order of attributes given | ||
to the constructor or the order of keys used in JSON. | ||
5. The value of the ID should **not** depend on the amount of whitespace in | ||
the input JSON. | ||
6. The value of the ID should **not** depend on whether the undefined values | ||
of attributes are explicitly set. | ||
|
||
The examples below clarity the rules. Objects `o1` - `o6` should generate the same ID. | ||
|
||
``` | ||
o1 = PacBioEntity(run_name="r1", well_label="l1") | ||
o2 = PacBioEntity(run_name="r1", well_label="l1", tags = None) | ||
o3 = PacBioEntity(well_label="l1", run_name="r1", ) | ||
o4 = PacBioEntity.parse_raw('{"run_name": "r1","well_label": "l1"}', content_type="json") | ||
o5 = PacBioEntity.parse_raw('{"well_label": "l1", "run_name": "r1"}', content_type="json") | ||
o6 = PacBioEntity.parse_raw('{"well_label": "l1","run_name": "r1", "tags": null}', content_type="json") | ||
``` | ||
|
||
The algorithm used for generation of identifiers can be replicated in Perl; | ||
on identical input data it gives identical results. However, we cannot guarantee | ||
that this parity will always be maintained in future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
#!/usr/bin/env python3 | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright © 2022 Genome Research Ltd. All rights reserved. | ||
# | ||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
# | ||
# @author Michael Kubiak <[email protected]> | ||
|
||
import argparse | ||
from npg_id_generation.pac_bio import PacBioEntity | ||
|
||
parser = argparse.ArgumentParser( | ||
description="A script to generate a product id for a pac bio product from a given run and well" | ||
) | ||
|
||
parser.add_argument( | ||
"run_name", type=str, help="The name of the run to which the product belongs" | ||
) | ||
|
||
parser.add_argument("well_label", type=str, help="The well label") | ||
|
||
args = parser.parse_args() | ||
|
||
|
||
def main(): | ||
print( | ||
f"{PacBioEntity(run_name=args.run_name, well_label=args.well_label).hash_product_id()}\n" | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +0,0 @@ | ||
from .main import PacBioEntity | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.