Modify preconsume script to work on one cohort at a time #1107

callachennault · 2024-02-01T17:56:14Z

This PR :

modifies the preconsume_problematic_samples.sh script to work on one cohort at a time.
Modifies the fetch-dmp-data-for-import.sh script to call the preconsume_problematic_samples.sh once for each cohort, immediately before the CVR fetch for that cohort.
This should prevent problematic samples from being queued in the time between the preconsume step and the CVR fetcher step.

averyniceday · 2024-02-02T16:33:28Z

import-scripts/preconsume_problematic_samples.sh

@@ -124,14 +125,18 @@ function consume_hardcoded_samples() {
    rm -f ${PROBLEMATIC_EVENT_CONSUME_IDS_FILEPATH} ${PROBLEMATIC_METADATA_CONSUME_IDS_FILEPATH}
    touch ${PROBLEMATIC_EVENT_CONSUME_IDS_FILEPATH}
    touch ${PROBLEMATIC_METADATA_CONSUME_IDS_FILEPATH}
-    echo "P-0025907-N01-IM6" >> "${PROBLEMATIC_METADATA_CONSUME_IDS_FILEPATH}"


I mean, know this was here before, but any idea what's this for 😆 Is it not getting caught by our usual script

yeah I looked into it, this is a sample we're receiving from the tumor server but it has a normal identifier. Rob added it to this consume_hardcoded_samples function until we got more info/a fix from CVR. I reached out to Mihir about it again and he said to continue consuming it and he'll get back to me if there's any change

import-scripts/preconsume_problematic_samples.sh

averyniceday

small suggestion for change, but otherwise looks good!

sheridancbio

Nice work. My only question is whether the script detect_samples_with_problematic_metadata.py will run correctly on the impact / impact-heme / access cohorts. If we don't want to worry about whether it runs correctly or not we could skip over the scan for metadata problems in these datasets. But the metadata is at the top level of the json object returned for samples, so it probably is compatible to all cohorts.

import-scripts/fetch-dmp-data-for-import.sh

sheridancbio · 2024-02-02T18:29:08Z

import-scripts/preconsume_problematic_samples.sh

 }

 function detect_samples_with_problematic_metadata() {
-    $DETECT_SAMPLES_WITH_PROBLEMATIC_METADATA_SCRIPT_FILEPATH ${ARCHER_FETCH_OUTPUT_FILEPATH} ${ARCHER_CONSUME_IDS_FILEPATH}


I guess before we were only scanning archer for genepanel=UNKNOWN. This change seems to mean that we will scan all 4 cohorts for bad genepanel references (probably a good thing). Does this run smoothly though on other cohorts? (The json schema may differ)

Yes, with the way I rewrote it, all cohorts will be checked for problematic events + problematic metadata. I figured it made sense since we already have the functionality. I didn't run into any errors testing on the 4 cohorts but the queues that I pulled didn't have any issues in them so I might need to test a bit more to make sure

import-scripts/preconsume_problematic_samples.sh

…stems#1107)

author Manda Wilson <[email protected]> 1703199176 -0500 committer Robert Sheridan <[email protected]> 1711560265 -0400 upgrade to java 21 switch to genome-nexus-annotation-pipeline that uses new maf repo updated to spring 6, spring batch 5, spring boot 3 to match cbioportal fix typos Updates to AZ-MSKIMPACT to integrate with CDM (knowledgesystems#1098) Fix bug in checking for duplicate Mutation Records (knowledgesystems#1099) * Check if mutationRecord is duplicated before annotating * Populate mutationMap in loadMutationRecordsFromJson * add addRecordToMap * Remove comments, add local vars for debugging * Remove duplicate MAF variants for AZ * Fix remove-duplicate-maf-variants call * revert whitespace change updates for migrating darwin and crdb to java11 (knowledgesystems#1080) pom changes for pulling moved dependencies changes to java args to silence warnings Co-authored-by: cbioportal import user <[email protected]> Remove Annotated MAF before Import (knowledgesystems#958) * remove annotated MAF to prevent duplicate * Update subset_and_merge_crdb_pdx_studies.py --------- Co-authored-by: Avery Wang <[email protected]> Script to combine arbitrary files (knowledgesystems#1104) * Script to combine arbitrary files * Modify unit tests to work with script changes * Remove unnecessary column specifier * Fix syntax bug Add sophia script (knowledgesystems#1105) * Add sophia script * rename transpose_cna file * Add filter-clinical-arg-functions script * Add az var to correct automation environment * Add correct path to transpose_cna script * Call seq_date function * Add seq_date before filtering columns * syntax fix * Fix call to filter out clinical attribute columns * Fix nonsigned out file path * Automate folder name * directory fixes * remove quotes? * change date formatting * output filepath for duplicate variants script * use az_msk_impact_data_home var * move sophia_data_home to automation environment * Add comments * Change dir structures in sophia script to match new repo structure * Add git operations * Remove test file * Fix dirs for sophia zip command * remove quotes * Zip files before cleanup * move zip step before git push Add script for merging Dremio/SMILE into cmo-access (knowledgesystems#1102) - adds cfdna clinical and timeline data from dremio/SMILE - converts patient identifiers using "dmp over cmo" identifier logic from dremio - dremio patient id mapping table export code called to produce mapping table - main script then calls update_cfdna_clinical_sample_patient_ids_via_dremio.sh - merge.py used to combine clinical data from dremio with clinical data from cmo-access - metadata headers added using new script : merge_clinical_metadata_headers_py3.py - other import process flow (similar to other import scripts) followed - error detection step added after debugging for sporadic data loss in results Co-authored-by: Manda Wilson <[email protected]> Modify preconsume script to work on one cohort at a time (knowledgesystems#1107) Call correct function name add options for logging in for different accounts Preconsume archer-solid-cv4 and add fetch loop (knowledgesystems#1129) * Handle archer-solid-cv4 samples * Add loop * move each cohort to its own dir and fix filename switch to genome-nexus-annotation-pipeline that uses new maf repo use updated genome-nexus-annotation-pipeline update version of cmo-pipelines to 1.0.0 Convert BatchConfiguration to new Spring Batch format drop unneeded dependency from redcap removed gdd, updated crdb and ddp batch configs to spring batch 5 removed commons-lang start of converting cvr to spring batch 5 fix cvr fetcher BatchConfiguration fixed redcap pipeline spring batch 5 configuration make spring-batch-integration match batch version Co-authored-by: Manda Wilson <[email protected]> drop darwin fetcher (and docs/scripts)

…stems#1107)

callachennault added 4 commits February 1, 2024 12:55

Modify preconsume script to work on one cohort at a time

462830f

Update preconsume script calls, add cohort name to output files:

61b0a95

Remove comment

6c5967c

add cohort name to log messages

9921bc0

callachennault marked this pull request as ready for review February 1, 2024 21:58

averyniceday reviewed Feb 2, 2024

View reviewed changes

import-scripts/preconsume_problematic_samples.sh Outdated Show resolved Hide resolved

averyniceday approved these changes Feb 2, 2024

View reviewed changes

sheridancbio approved these changes Feb 2, 2024

View reviewed changes

callachennault added 3 commits February 2, 2024 14:04

Remove whitespace

38464a5

Add usage

ae105ba

call check_args

9fc4e7a

callachennault merged commit 80ab3c8 into knowledgesystems:master Feb 6, 2024
2 checks passed

callachennault deleted the preconsume-fix branch February 6, 2024 19:23

sheridancbio pushed a commit to sheridancbio/cmo-pipelines that referenced this pull request Feb 9, 2024

Modify preconsume script to work on one cohort at a time (knowledgesy…

3a9ff4b

…stems#1107)

mandawilson pushed a commit to mandawilson/cmo-pipelines that referenced this pull request Mar 27, 2024

Modify preconsume script to work on one cohort at a time (knowledgesy…

e9c7883

…stems#1107)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify preconsume script to work on one cohort at a time #1107

Modify preconsume script to work on one cohort at a time #1107

callachennault commented Feb 1, 2024 •

edited

Loading

averyniceday Feb 2, 2024

callachennault Feb 2, 2024 •

edited

Loading

averyniceday left a comment

sheridancbio left a comment

sheridancbio Feb 2, 2024

callachennault Feb 2, 2024

Modify preconsume script to work on one cohort at a time #1107

Modify preconsume script to work on one cohort at a time #1107

Conversation

callachennault commented Feb 1, 2024 • edited Loading

averyniceday Feb 2, 2024

Choose a reason for hiding this comment

callachennault Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

averyniceday left a comment

Choose a reason for hiding this comment

sheridancbio left a comment

Choose a reason for hiding this comment

sheridancbio Feb 2, 2024

Choose a reason for hiding this comment

callachennault Feb 2, 2024

Choose a reason for hiding this comment

callachennault commented Feb 1, 2024 •

edited

Loading

callachennault Feb 2, 2024 •

edited

Loading