Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PharmGKB Data Import #1056

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

PharmGKB Data Import #1056

wants to merge 26 commits into from

Conversation

spiekos
Copy link
Contributor

@spiekos spiekos commented Jul 17, 2024

This documents the PharmGKB data import including introducing the data, the import process, new schema, scripts, tmcfs, and import process.

spiekos and others added 19 commits July 12, 2024 09:50
Add README.md file that explains the import including introducing the datasets, description of the import, file artifacts, and import procedure
Add initial draft of README.md from Suhana
Add tmcf and scripts files that support import of primary and relationships data from PharmGKB
Add manual mapping file between diseases PharmGKB Ids and MeSH Descriptor Ids, which are not retrieved via name matching to existing MeSH Descriptor nodes using the datacommons API
Add the mapping files between pharmgkb ids of chemicals and drugs to dcids representing existing corresponding nodes in data commons. These files are generated as output by the `format_chemicals.py` and the `format_drugs.py` scripts respectively.
Update table of contents including adding new sections, update the import procedure, script files and links, and the Notes and Caveat subsection
Fill in Dataset Documentation and Releavant Links subsection
Fix superscripting
Add About the Dataset, Download Data, and Dataset Overview subsections
Fill in the Artifacts subsection
Fill in schema overview subsection
update paths to tmcf files and the new schema subsection
update dcid Generation subsection for phenotypes
update New Schema subsection
update information regarding tests
@spiekos spiekos requested review from pradh, ajaits and hareesh-ms July 17, 2024 07:41
@spiekos spiekos self-assigned this Jul 17, 2024
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 18, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 23, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 653311391
copybara-service bot pushed a commit to datacommonsorg/schema that referenced this pull request Jul 23, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056

Schema Changes:
- Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum.
- Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount.
- Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags.

PiperOrigin-RevId: 655002332
spiekos added 2 commits July 29, 2024 17:10
Add information about dcid illegal character @ being replaced with _Cluster when generating Gene dcids.
@spiekos spiekos requested review from dwnoble and clincoln8 and removed request for pradh October 16, 2024 05:28
Update gene_var.tmcf to ssign Entity2 as Variant and Entity1 as Gene in the output csv file.
Fix references for entity1 vs entity2 in output file so that they correctly map to the GeneticVariant and Gene entities
Fix hierarchical ontology for classes used in import
@dwnoble
Copy link

dwnoble commented Nov 5, 2024

Hi @spiekos - is this PR still relevant?

@spiekos
Copy link
Contributor Author

spiekos commented Nov 6, 2024

Hi @spiekos - is this PR still relevant?

Yes! The data has already been integrated into BMDC. This GitHub PR documents the dataset and the data cleaning, but it's never been reviewed or approved. Are you able to review this @dwnoble?

@dwnoble
Copy link

dwnoble commented Nov 7, 2024

Looks generally good to me- Adding @chejennifer as a reviewer for an additional sanity check

@dwnoble dwnoble requested a review from chejennifer November 7, 2024 17:39
Copy link
Contributor

@chejennifer chejennifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some suggestions!

print("Error: One or both columns not found in the DataFrame.")
return df

# where missing values in one column with values of a second column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: where -> replace

list_formatted = [] # Initialize an empty list to store formatted values

for item in list_values:
check_for_illegal_charc(item) # Validate the item (function assumed to be defined elsewhere)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are illegal characters, besides just logging the message, should you also remove those values or exit the script or something?


Args:
s: The string to classify.
original_column: The name of the column to split.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see one argument in this function?

]


def get_unique_new_cols(df, col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of these functions seem to be the same in all the scripts, could we have a util file or something so that we don't have to rewrite every function in every script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants