Release 2.9.3 (#244)

* Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Display es scores (#199) * Include ES scores in variable results * Round ES score to 6 * Update _version.py (#200) * Dev version bump (#202) * Release/2.8.0 (#198) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Update _version.py * Update _version.py updating version for final push to master * Update factory.py Adding more comments Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> * Release/v2.9.0 (#201) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Display es scores (#199) * Include ES scores in variable results * Round ES score to 6 * Update _version.py (#200) * Update _version.py Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> * Attribute mapping from node to dug element (#203) * Release/2.8.0 (#198) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Update _version.py * Update _version.py updating version for final push to master * Update factory.py Adding more comments Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> * Release/v2.9.0 (#201) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Display es scores (#199) * Include ES scores in variable results * Round ES score to 6 * Update _version.py (#200) * Update _version.py Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> * adding more config options for node extraction * some refactoring Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> * Changed DbGaP to SPARC in the scicrunch parser (#204) * Anvil (#207) * Added updated anvil dataset catalog * Added script for downloading all anvil data dicts * Added current anvil data dictionaries to data folder to be used for indexing * Anvil parser (#208) * Release/2.8.0 (#198) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Update _version.py * Update _version.py updating version for final push to master * Update factory.py Adding more comments Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> * Release/v2.9.0 (#201) * Bumping version * support for extracting dug elements from graph (#197) * support for extracting dug elements from graph * adding flag for enabling dug element extraction from graph * adding new config for node_to dug element parsing * adding more parameters to crawler to able configuration to element extraction logic * add tests * add tests for crawler Co-authored-by: Yaphetkg <[email protected]> * Display es scores (#199) * Include ES scores in variable results * Round ES score to 6 * Update _version.py (#200) * Update _version.py Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> * anvil parser * bump number of files test * Update dbgap_parser.py * Update anvil_dbgap_parser.py change to AnVIL * Update test_parsers.py update test Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> * Initial Kaniko build. * Move version file definition. * Quote env vars. * Update env vars. * Update env vars. * Update env vars. * env var changes. * env var changes. * env var changes. * env var changes. * Update DOCKER_IMAGE var. * Update DOCKER_IMAGE var in kaniko cmd. * Update kaniko destination line. * Update kaniko destination line. * Moree variable madness. * Programatically remove quotes from version tag. * dug dump concepts api created and tested (#229) Co-authored-by: Nathan Braswell <[email protected]> * Update _version.py (#234) * Version changes + separate build and publish. * Semantic versioning prep. * Add develop and master versioning and tagging. * Ncpi index fix (#232) * Renamed anvil to ncpi * Update ncpi datasets catalog * Modified script to download NCPI datasets into platform subfolders * Updated NCPI integration dataset * Removed unused variable * Removed ncpi top level folder to spread results among subfolders * Change output dir to data instead of ncpi subdir * Moved NCPI subdirs into main data folder for ingest as per Yaphet's request Co-authored-by: Alex Waldrop <[email protected]> Co-authored-by: Carl Schreep <[email protected]> Co-authored-by: Yaphetkg <[email protected]> Co-authored-by: Ginnie Hench <[email protected]> Co-authored-by: Howard Lander <[email protected]> Co-authored-by: Alex Waldrop <[email protected]> Co-authored-by: Charles Bennett <[email protected]> Co-authored-by: Nathaniel Braswell <[email protected]> Co-authored-by: Nathan Braswell <[email protected]> Co-authored-by: cnbennett3 <[email protected]> Co-authored-by: Alex Waldrop <[email protected]>
helxplatform · Jul 12, 2022 · 0507cab · 0507cab
1 parent 88ef5c9
commit 0507cab
Show file tree

Hide file tree

Showing 10 changed files with 499 additions and 42 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -1,61 +1,131 @@
 pipeline {
-    agent {
-        kubernetes {
-            cloud 'kubernetes'
-            yaml '''
-              apiVersion: v1
-              kind: Pod
-              spec:
-                containers:
-                - name: agent-docker
-                  image: helxplatform/agent-docker:latest
-                  command: 
-                  - cat
-                  tty: true
-                  volumeMounts:
-                    - name: dockersock
-                      mountPath: "/var/run/docker.sock"
-                volumes:
-                - name: dockersock
-                  hostPath:
-                    path: /var/run/docker.sock 
-            '''
+  agent {
+    kubernetes {
+        label 'kaniko-build-agent'
+        yaml """
+kind: Pod
+metadata:
+  name: kaniko
+spec:
+  containers:
+  - name: jnlp
+    workingDir: /home/jenkins/agent/
+  - name: kaniko
+    workingDir: /home/jenkins/agent/
+    image: gcr.io/kaniko-project/executor:debug
+    imagePullPolicy: Always
+    resources:
+      requests:
+        cpu: "512m"
+        memory: "1024Mi"
+        ephemeral-storage: "4Gi"
+      limits:
+        cpu: "1024m"
+        memory: "2048Mi"
+        ephemeral-storage: "8Gi"
+    command:
+    - /busybox/cat
+    tty: true
+    volumeMounts:
+    - name: jenkins-docker-cfg
+      mountPath: /kaniko/.docker
+  - name: crane
+    workingDir: /tmp/jenkins
+    image: gcr.io/go-containerregistry/crane:debug
+    imagePullPolicy: Always
+    command:
+    - /busybox/cat
+    tty: true
+  volumes:
+  - name: jenkins-docker-cfg
+    projected:
+      sources:
+      - secret:
+          name: rencibuild-imagepull-secret
+          items:
+            - key: .dockerconfigjson
+              path: config.json
+"""
         }
     }
+    environment {
+        PATH = "/busybox:/kaniko:/ko-app/:$PATH"
+        DOCKERHUB_CREDS = credentials("${env.CONTAINERS_REGISTRY_CREDS_ID_STR}")
+        REGISTRY = "${env.REGISTRY}"
+        REG_OWNER="helxplatform"
+        REG_APP="dug"
+        COMMIT_HASH="${sh(script:"git rev-parse --short HEAD", returnStdout: true).trim()}"
+        VERSION_FILE="src/dug/_version.py"
+        VERSION="${sh(script:'awk \'{ print $3 }\' src/dug/_version.py | xargs', returnStdout: true).trim()}"
+        IMAGE_NAME="${REGISTRY}/${REG_OWNER}/${REG_APP}"
+        TAG1="$BRANCH_NAME"
+        TAG2="$COMMIT_HASH"
+        TAG3="$VERSION"
+        TAG4="latest"
+    }
     stages {
-        stage('Install') {
+        stage('Build') {
             steps {
-                container('agent-docker') {
-                    sh '''
-                    make install
-                    '''
+                container(name: 'kaniko', shell: '/busybox/sh') {
+                    sh '''#!/busybox/sh
+                       echo "Build stage"
+                       /kaniko/executor --dockerfile ./Dockerfile \
+                                         --context . \
+                                         --verbosity debug \
+                                         --no-push \
+                                         --destination $IMAGE_NAME:$TAG1 \
+                                         --destination $IMAGE_NAME:$TAG2 \
+                                         --destination $IMAGE_NAME:$TAG3 \
+                                         --destination $IMAGE_NAME:$TAG4 \
+                                         --tarPath image.tar
+                       '''
+                }
+            }
+            post {
+                always {
+                    archiveArtifacts artifacts: 'image.tar', onlyIfSuccessful: true
                 }
             }
         }
         stage('Test') {
             steps {
-                container('agent-docker') {
-                    sh '''
-                    make test
-                    '''
-                }
+                sh '''
+                echo "Test stage"
+                '''
             }
         }
         stage('Publish') {
-            when {
-                buildingTag()
-            }
-            environment {
-                DOCKERHUB_CREDS = credentials('rencibuild_dockerhub_machine_user')
-            }
             steps {
-                container('agent-docker') {
+                container(name: 'crane', shell: '/busybox/sh') {
                     sh '''
-                    echo $DOCKERHUB_CREDS_PSW | docker login -u $DOCKERHUB_CREDS_USR --password-stdin
-                    make publish
+                    echo "Publish stage"
+                    echo "$DOCKERHUB_CREDS_PSW" | crane auth login -u $DOCKERHUB_CREDS_USR --password-stdin $REGISTRY
+                    crane push image.tar $IMAGE_NAME:$TAG1
+                    crane push image.tar $IMAGE_NAME:$TAG2
+                    if [ $BRANCH_NAME == "develop" ]; then
+                        crane push image.tar $IMAGE_NAME:$TAG3
+                    elif [ $BRANCH_NAME == "master" ]; then
+                        crane push image.tar $IMAGE_NAME:$TAG3
+                        crane push image.tar $IMAGE_NAME:$TAG4
+                        if [ $(git tag -l "$VERSION") ]; then
+                            echo "ERROR: Tag with version $VERSION already exists! Exiting."
+                        else
+                            # Recover some things we've lost:
+                            git config --global user.email "helx-dev@lists"
+                            git config --global user.name "rencibuild rencibuild"
+                            grep url .git/config
+                            git checkout $BRANCH_NAME
+
+                            # Set the tag
+                            SHA=$(git log --oneline | head -1 | awk '{print $1}')
+                            git tag $VERSION "$SHA"
+                            git remote set-url origin https://[email protected]/helxplatform/dug.git
+                            git push origin --tags
+                        fi
+                    fi
                     '''
                 }
             }
         }
     }
-}
+}
diff --git a/bin/get_ncpi_data_dicts.py b/bin/get_ncpi_data_dicts.py
@@ -0,0 +1,124 @@
+####### ANVIL Syncing Script
+
+# This script is used to generate the input to index Anvil Datasets on Dug
+# Parse, Download dbgap datasets currently hosted on Anvil Platform (tsv downloaded from https://anvilproject.org/data)
+# Output all datasets to an output tarball into the data directory to be indexed
+# NOTE: The ncpi-dataset-catalog-results.tsv should be updated manually to ensure you sync all current Anvil datasets
+
+#######
+
+import os
+import shutil
+from ftplib import FTP, error_perm
+import csv
+
+# Hard-coded relative paths for the anvil catalog input file and output bolus
+# This obviously isn't very elegant but it'll do for now
+input_file = "../data/ncpi-dataset-catalog-results.tsv"
+output_dir = "../data/"
+
+
+# Helper function
+def download_dbgap_study(study_id, output_dir):
+    # Download a dbgap study to a specific directory
+
+    ftp = FTP('ftp.ncbi.nlm.nih.gov')
+    ftp.login()
+    study_variable = study_id.split('.')[0]
+    os.makedirs(f"{output_dir}/{study_id}")
+
+    # Step 1: First we try and get all the data_dict files
+    try:
+        ftp.cwd(f"/dbgap/studies/{study_variable}/{study_id}/pheno_variable_summaries")
+    except error_perm:
+        print(f"WARN: Unable to find data dicts for study: {study_id}")
+        # Delete subdirectory so we don't think it's full
+        shutil.rmtree(f"{output_dir}/{study_id}")
+        return False
+
+    ftp_filelist = ftp.nlst(".")
+    for ftp_filename in ftp_filelist:
+        if 'data_dict' in ftp_filename:
+            with open(f"{output_dir}/{study_id}/{ftp_filename}", "wb") as data_dict_file:
+                    ftp.retrbinary(f"RETR {ftp_filename}", data_dict_file.write)
+
+    # Step 2: Check to see if there's a GapExchange file in the parent folder
+    #         and if there is, get it.
+    ftp.cwd(f"/dbgap/studies/{study_variable}/{study_id}")
+    ftp_filelist = ftp.nlst(".")
+    for ftp_filename in ftp_filelist:
+        if 'GapExchange' in ftp_filename:
+            with open(f"{output_dir}/{study_id}/{ftp_filename}", "wb") as data_dict_file:
+                ftp.retrbinary(f"RETR {ftp_filename}", data_dict_file.write)
+    ftp.quit()
+    return True
+
+
+def main():
+    # Delete any existing output dirs so you can ensure all datasets are fresh
+    #if os.path.isdir(output_dir):
+    #    shutil.rmtree(output_dir)
+
+    # Make new output dir
+    os.makedirs(f"{output_dir}/", exist_ok=True)
+
+    # Parse input table and download all valid dbgap datasets to output
+    missing_data_dict_studies = {}
+    studies = {}
+
+    with open(input_file) as csv_file:
+        csv_reader = csv.DictReader(csv_file, delimiter="\t")
+        header = False
+        for row in csv_reader:
+            if not header:
+                # Check to make sure tsv contains column for Study Accession
+                if "Study Accession" not in row:
+                    # Throw error if expected column is missing
+                    raise IOError("Input file must contain 'Study Accession' column")
+                header = True
+                continue
+
+            # Get platform and make subdir if necessary
+            platform = row["Platform"].split(";")
+            platform = platform[0] if "BDC" not in platform else "BDC"
+
+            # Add any phs dbgap studies to queue of files to get
+            study_id = row["Study Accession"]
+            if study_id.startswith("phs") and study_id not in studies:
+                studies[study_id] = True
+                try:
+                    # Try to download to output folder if the study hasn't already been downloaded
+                    if not os.path.exists(f"{output_dir}/{platform}/{study_id}"):
+                        print(f"Downloading: {study_id}")
+                        if not download_dbgap_study(study_id, f"{output_dir}/{platform}"):
+                            missing_data_dict_studies[study_id] = True
+
+                except Exception as e:
+                    # If anything happens, delete the folder so we don't mistake it for success
+                    shutil.rmtree(f"{output_dir}/{platform}/{study_id}")
+
+    # Count the number subdir currently in output_dir as the number of downloaded
+    num_downloaded  = len([path for path in os.walk(output_dir) if path[0] != output_dir])
+
+    # Get number of failed for missing data dicts
+    num_missing_data_dicts = len(list(missing_data_dict_studies.keys()))
+
+    # Total number of possible unique studies
+    num_possible = len(list(studies.keys()))
+
+    # Write out list of datasets with no data dicts
+    with open(f"{output_dir}/download_summary.txt", "w") as sum_file:
+        sum_file.write(f"Unique dbgap datasets in ncpi table: {num_possible}\n")
+        sum_file.write(f"Successfully Downloaded: {num_downloaded}\n")
+        sum_file.write(f"Total dbgap datasests missing data dicts: {num_missing_data_dicts}\n")
+        sum_file.write(f"Dbgap datasests missing data dicts:\n")
+        for item in missing_data_dict_studies:
+            sum_file.write(f"{item}\n")
+
+    print(f"Unique dbgap datasets in ncpi table: {num_possible}\n")
+    print(f"Successfully Downloaded: {num_downloaded}\n")
+    print(f"Total dbgap datasests missing data dicts: {num_missing_data_dicts}\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/data/AnVIL.tar.gz b/data/AnVIL.tar.gz
diff --git a/data/BDC.tar.gz b/data/BDC.tar.gz
diff --git a/data/CRDC.tar.gz b/data/CRDC.tar.gz
diff --git a/data/KFDRC.tar.gz b/data/KFDRC.tar.gz