PharmGKB Data Import #1056

spiekos · 2024-07-17T07:41:51Z

This documents the PharmGKB data import including introducing the data, the import process, new schema, scripts, tmcfs, and import process.

Add README.md file that explains the import including introducing the datasets, description of the import, file artifacts, and import procedure

Add initial draft of README.md from Suhana

Add tmcf and scripts files that support import of primary and relationships data from PharmGKB

Add manual mapping file between diseases PharmGKB Ids and MeSH Descriptor Ids, which are not retrieved via name matching to existing MeSH Descriptor nodes using the datacommons API

Add the mapping files between pharmgkb ids of chemicals and drugs to dcids representing existing corresponding nodes in data commons. These files are generated as output by the `format_chemicals.py` and the `format_drugs.py` scripts respectively.

Update table of contents including adding new sections, update the import procedure, script files and links, and the Notes and Caveat subsection

Fill in Dataset Documentation and Releavant Links subsection

Fix superscripting

Add About the Dataset, Download Data, and Dataset Overview subsections

Fill in the Artifacts subsection

Fill in schema overview subsection

update paths to tmcf files and the new schema subsection

update dcid Generation subsection for phenotypes

update New Schema subsection

update information regarding tests

…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391

…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 655002332

Add information about dcid illegal character @ being replaced with _Cluster when generating Gene dcids.

Update gene_var.tmcf to ssign Entity2 as Variant and Entity1 as Gene in the output csv file.

Fix references for entity1 vs entity2 in output file so that they correctly map to the GeneticVariant and Gene entities

Fix hierarchical ontology for classes used in import

dwnoble · 2024-11-05T19:35:46Z

Hi @spiekos - is this PR still relevant?

spiekos · 2024-11-06T03:01:17Z

Hi @spiekos - is this PR still relevant?

Yes! The data has already been integrated into BMDC. This GitHub PR documents the dataset and the data cleaning, but it's never been reviewed or approved. Are you able to review this @dwnoble?

dwnoble · 2024-11-07T17:39:11Z

Looks generally good to me- Adding @chejennifer as a reviewer for an additional sanity check

chejennifer

just some suggestions!

chejennifer · 2024-11-09T15:43:19Z

scripts/biomedical/PharmGKB/scripts/format_chemicals.py

+        print("Error: One or both columns not found in the DataFrame.")
+        return df
+
+    # where missing values in one column with values of a second column


nit: where -> replace

chejennifer · 2024-11-09T15:45:37Z

scripts/biomedical/PharmGKB/scripts/format_chemicals.py

+        list_formatted = [] # Initialize an empty list to store formatted values
+
+        for item in list_values: 
+            check_for_illegal_charc(item) # Validate the item (function assumed to be defined elsewhere)


if there are illegal characters, besides just logging the message, should you also remove those values or exit the script or something?

chejennifer · 2024-11-09T15:47:53Z

scripts/biomedical/PharmGKB/scripts/format_chemicals.py

+
+    Args:
+        s: The string to classify.
+        original_column: The name of the column to split.


I only see one argument in this function?

chejennifer · 2024-11-10T05:21:39Z

scripts/biomedical/PharmGKB/scripts/format_drugs.py

+]
+
+
+def get_unique_new_cols(df, col):


a lot of these functions seem to be the same in all the scripts, could we have a util file or something so that we don't have to rewrite every function in every script

spiekos and others added 19 commits July 12, 2024 09:50

Create README.md

32a277b

Add README.md file that explains the import including introducing the datasets, description of the import, file artifacts, and import procedure

Update README.md

336a192

Add initial draft of README.md from Suhana

Add tmcf and scripts files

f3f9044

Add tmcf and scripts files that support import of primary and relationships data from PharmGKB

Add manual mapping file

44f4cb1

Add manual mapping file between diseases PharmGKB Ids and MeSH Descriptor Ids, which are not retrieved via name matching to existing MeSH Descriptor nodes using the datacommons API

Redirect path to disease_pharmgkbID_to_dcid.csv

ae0e339

Update README.md

d88a100

Update table of contents including adding new sections, update the import procedure, script files and links, and the Notes and Caveat subsection

Update README.md

1d51062

Fill in Dataset Documentation and Releavant Links subsection

Update README.md

efeac6e

Fix superscripting

Update README.md

e07ff07

Add About the Dataset, Download Data, and Dataset Overview subsections

Update README.md

980a2c4

Fill in the Artifacts subsection

Update README.md

79aa542

Fill in schema overview subsection

Update README.md

84005f0

update paths to tmcf files and the new schema subsection

Update README.md

a2f82c4

update dcid Generation subsection for phenotypes

Update README.md

48ed875

update New Schema subsection

Update README.md

2212915

update information regarding tests

update chemicals and drugs mapping files

7cd7d3b

update scripts

7b953bf

update tmcf files

26c60d4

spiekos requested review from pradh, ajaits and hareesh-ms July 17, 2024 07:41

spiekos self-assigned this Jul 17, 2024

spiekos added 2 commits July 29, 2024 17:10

add header

ad33974

Update README.md

4c96dfb

Add information about dcid illegal character @ being replaced with _Cluster when generating Gene dcids.

spiekos requested review from dwnoble and clincoln8 and removed request for pradh October 16, 2024 05:28

spiekos added 4 commits October 15, 2024 22:28

Merge branch 'master' into PharmGKB

aa05800

Update gene_var.tmcf

63e767a

Update gene_var.tmcf to ssign Entity2 as Variant and Entity1 as Gene in the output csv file.

Update gene_var.tmcf

76ba056

Fix references for entity1 vs entity2 in output file so that they correctly map to the GeneticVariant and Gene entities

Update README.md

5fe4a06

Fix hierarchical ontology for classes used in import

Merge branch 'master' into PharmGKB

0764a1e

dwnoble requested a review from chejennifer November 7, 2024 17:39

dwnoble approved these changes Nov 7, 2024

View reviewed changes

chejennifer reviewed Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PharmGKB Data Import #1056

PharmGKB Data Import #1056

spiekos commented Jul 17, 2024

dwnoble commented Nov 5, 2024

spiekos commented Nov 6, 2024

dwnoble commented Nov 7, 2024

chejennifer left a comment

chejennifer Nov 9, 2024

chejennifer Nov 9, 2024

chejennifer Nov 9, 2024

chejennifer Nov 10, 2024

PharmGKB Data Import #1056

Are you sure you want to change the base?

PharmGKB Data Import #1056

Conversation

spiekos commented Jul 17, 2024

dwnoble commented Nov 5, 2024

spiekos commented Nov 6, 2024

dwnoble commented Nov 7, 2024

chejennifer left a comment

Choose a reason for hiding this comment

chejennifer Nov 9, 2024

Choose a reason for hiding this comment

chejennifer Nov 9, 2024

Choose a reason for hiding this comment

chejennifer Nov 9, 2024

Choose a reason for hiding this comment

chejennifer Nov 10, 2024

Choose a reason for hiding this comment