Skip to content

Commit

Permalink
Release 2.9.3 (#244)
Browse files Browse the repository at this point in the history
* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Display es scores (#199)

* Include ES scores in variable results

* Round ES score to 6

* Update _version.py (#200)

* Dev version bump (#202)

* Release/2.8.0 (#198)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Update _version.py

* Update _version.py

updating version for final push to master

* Update factory.py

Adding more comments

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>

* Release/v2.9.0 (#201)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Display es scores (#199)

* Include ES scores in variable results

* Round ES score to 6

* Update _version.py (#200)

* Update _version.py

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

* Attribute mapping from node to dug element (#203)

* Release/2.8.0 (#198)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Update _version.py

* Update _version.py

updating version for final push to master

* Update factory.py

Adding more comments

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>

* Release/v2.9.0 (#201)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Display es scores (#199)

* Include ES scores in variable results

* Round ES score to 6

* Update _version.py (#200)

* Update _version.py

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

* adding more config options for node extraction

* some refactoring

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

* Changed DbGaP to SPARC in the scicrunch parser (#204)

* Anvil (#207)

* Added updated anvil dataset catalog

* Added script for downloading all anvil data dicts

* Added current anvil data dictionaries to data folder to be used for indexing

* Anvil parser (#208)

* Release/2.8.0 (#198)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Update _version.py

* Update _version.py

updating version for final push to master

* Update factory.py

Adding more comments

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>

* Release/v2.9.0 (#201)

* Bumping version

* support for extracting dug elements from graph (#197)

* support for extracting dug elements from graph

* adding flag for enabling dug element extraction from graph

* adding new config for node_to dug element parsing

* adding more parameters to crawler to able configuration to element extraction logic

* add tests

* add tests for crawler

Co-authored-by: Yaphetkg <[email protected]>

* Display es scores (#199)

* Include ES scores in variable results

* Round ES score to 6

* Update _version.py (#200)

* Update _version.py

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

* anvil parser

* bump number of files test

* Update dbgap_parser.py

* Update anvil_dbgap_parser.py

change to AnVIL

* Update test_parsers.py

update test

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>

* Initial Kaniko build.

* Move version file definition.

* Quote env vars.

* Update env vars.

* Update env vars.

* Update env vars.

* env var changes.

* env var changes.

* env var changes.

* env var changes.

* Update DOCKER_IMAGE var.

* Update DOCKER_IMAGE var in kaniko cmd.

* Update kaniko destination line.

* Update kaniko destination line.

* Moree variable madness.

* Programatically remove quotes from version tag.

* dug dump concepts api created and tested (#229)

Co-authored-by: Nathan Braswell <[email protected]>

* Update _version.py (#234)

* Version changes + separate build and publish.

* Semantic versioning prep.

* Add develop and master versioning and tagging.

* Ncpi index fix (#232)

* Renamed anvil to ncpi

* Update ncpi datasets catalog

* Modified script to download NCPI datasets into platform subfolders

* Updated NCPI integration dataset

* Removed unused variable

* Removed ncpi top level folder to spread results among subfolders

* Change output dir to data instead of ncpi subdir

* Moved NCPI subdirs into main data folder for ingest as per Yaphet's request

Co-authored-by: Alex Waldrop <[email protected]>

Co-authored-by: Carl Schreep <[email protected]>
Co-authored-by: Yaphetkg <[email protected]>
Co-authored-by: Ginnie Hench <[email protected]>
Co-authored-by: Howard Lander <[email protected]>
Co-authored-by: Alex Waldrop <[email protected]>
Co-authored-by: Charles Bennett <[email protected]>
Co-authored-by: Nathaniel Braswell <[email protected]>
Co-authored-by: Nathan Braswell <[email protected]>
Co-authored-by: cnbennett3 <[email protected]>
Co-authored-by: Alex Waldrop <[email protected]>
  • Loading branch information
11 people authored Jul 12, 2022
1 parent 88ef5c9 commit 0507cab
Show file tree
Hide file tree
Showing 10 changed files with 499 additions and 42 deletions.
152 changes: 111 additions & 41 deletions Jenkinsfile
Original file line number Diff line number Diff line change
@@ -1,61 +1,131 @@
pipeline {
agent {
kubernetes {
cloud 'kubernetes'
yaml '''
apiVersion: v1
kind: Pod
spec:
containers:
- name: agent-docker
image: helxplatform/agent-docker:latest
command:
- cat
tty: true
volumeMounts:
- name: dockersock
mountPath: "/var/run/docker.sock"
volumes:
- name: dockersock
hostPath:
path: /var/run/docker.sock
'''
agent {
kubernetes {
label 'kaniko-build-agent'
yaml """
kind: Pod
metadata:
name: kaniko
spec:
containers:
- name: jnlp
workingDir: /home/jenkins/agent/
- name: kaniko
workingDir: /home/jenkins/agent/
image: gcr.io/kaniko-project/executor:debug
imagePullPolicy: Always
resources:
requests:
cpu: "512m"
memory: "1024Mi"
ephemeral-storage: "4Gi"
limits:
cpu: "1024m"
memory: "2048Mi"
ephemeral-storage: "8Gi"
command:
- /busybox/cat
tty: true
volumeMounts:
- name: jenkins-docker-cfg
mountPath: /kaniko/.docker
- name: crane
workingDir: /tmp/jenkins
image: gcr.io/go-containerregistry/crane:debug
imagePullPolicy: Always
command:
- /busybox/cat
tty: true
volumes:
- name: jenkins-docker-cfg
projected:
sources:
- secret:
name: rencibuild-imagepull-secret
items:
- key: .dockerconfigjson
path: config.json
"""
}
}
environment {
PATH = "/busybox:/kaniko:/ko-app/:$PATH"
DOCKERHUB_CREDS = credentials("${env.CONTAINERS_REGISTRY_CREDS_ID_STR}")
REGISTRY = "${env.REGISTRY}"
REG_OWNER="helxplatform"
REG_APP="dug"
COMMIT_HASH="${sh(script:"git rev-parse --short HEAD", returnStdout: true).trim()}"
VERSION_FILE="src/dug/_version.py"
VERSION="${sh(script:'awk \'{ print $3 }\' src/dug/_version.py | xargs', returnStdout: true).trim()}"
IMAGE_NAME="${REGISTRY}/${REG_OWNER}/${REG_APP}"
TAG1="$BRANCH_NAME"
TAG2="$COMMIT_HASH"
TAG3="$VERSION"
TAG4="latest"
}
stages {
stage('Install') {
stage('Build') {
steps {
container('agent-docker') {
sh '''
make install
'''
container(name: 'kaniko', shell: '/busybox/sh') {
sh '''#!/busybox/sh
echo "Build stage"
/kaniko/executor --dockerfile ./Dockerfile \
--context . \
--verbosity debug \
--no-push \
--destination $IMAGE_NAME:$TAG1 \
--destination $IMAGE_NAME:$TAG2 \
--destination $IMAGE_NAME:$TAG3 \
--destination $IMAGE_NAME:$TAG4 \
--tarPath image.tar
'''
}
}
post {
always {
archiveArtifacts artifacts: 'image.tar', onlyIfSuccessful: true
}
}
}
stage('Test') {
steps {
container('agent-docker') {
sh '''
make test
'''
}
sh '''
echo "Test stage"
'''
}
}
stage('Publish') {
when {
buildingTag()
}
environment {
DOCKERHUB_CREDS = credentials('rencibuild_dockerhub_machine_user')
}
steps {
container('agent-docker') {
container(name: 'crane', shell: '/busybox/sh') {
sh '''
echo $DOCKERHUB_CREDS_PSW | docker login -u $DOCKERHUB_CREDS_USR --password-stdin
make publish
echo "Publish stage"
echo "$DOCKERHUB_CREDS_PSW" | crane auth login -u $DOCKERHUB_CREDS_USR --password-stdin $REGISTRY
crane push image.tar $IMAGE_NAME:$TAG1
crane push image.tar $IMAGE_NAME:$TAG2
if [ $BRANCH_NAME == "develop" ]; then
crane push image.tar $IMAGE_NAME:$TAG3
elif [ $BRANCH_NAME == "master" ]; then
crane push image.tar $IMAGE_NAME:$TAG3
crane push image.tar $IMAGE_NAME:$TAG4
if [ $(git tag -l "$VERSION") ]; then
echo "ERROR: Tag with version $VERSION already exists! Exiting."
else
# Recover some things we've lost:
git config --global user.email "helx-dev@lists"
git config --global user.name "rencibuild rencibuild"
grep url .git/config
git checkout $BRANCH_NAME
# Set the tag
SHA=$(git log --oneline | head -1 | awk '{print $1}')
git tag $VERSION "$SHA"
git remote set-url origin https://[email protected]/helxplatform/dug.git
git push origin --tags
fi
fi
'''
}
}
}
}
}
}
124 changes: 124 additions & 0 deletions bin/get_ncpi_data_dicts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
####### ANVIL Syncing Script

# This script is used to generate the input to index Anvil Datasets on Dug
# Parse, Download dbgap datasets currently hosted on Anvil Platform (tsv downloaded from https://anvilproject.org/data)
# Output all datasets to an output tarball into the data directory to be indexed
# NOTE: The ncpi-dataset-catalog-results.tsv should be updated manually to ensure you sync all current Anvil datasets

#######

import os
import shutil
from ftplib import FTP, error_perm
import csv

# Hard-coded relative paths for the anvil catalog input file and output bolus
# This obviously isn't very elegant but it'll do for now
input_file = "../data/ncpi-dataset-catalog-results.tsv"
output_dir = "../data/"


# Helper function
def download_dbgap_study(study_id, output_dir):
# Download a dbgap study to a specific directory

ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login()
study_variable = study_id.split('.')[0]
os.makedirs(f"{output_dir}/{study_id}")

# Step 1: First we try and get all the data_dict files
try:
ftp.cwd(f"/dbgap/studies/{study_variable}/{study_id}/pheno_variable_summaries")
except error_perm:
print(f"WARN: Unable to find data dicts for study: {study_id}")
# Delete subdirectory so we don't think it's full
shutil.rmtree(f"{output_dir}/{study_id}")
return False

ftp_filelist = ftp.nlst(".")
for ftp_filename in ftp_filelist:
if 'data_dict' in ftp_filename:
with open(f"{output_dir}/{study_id}/{ftp_filename}", "wb") as data_dict_file:
ftp.retrbinary(f"RETR {ftp_filename}", data_dict_file.write)

# Step 2: Check to see if there's a GapExchange file in the parent folder
# and if there is, get it.
ftp.cwd(f"/dbgap/studies/{study_variable}/{study_id}")
ftp_filelist = ftp.nlst(".")
for ftp_filename in ftp_filelist:
if 'GapExchange' in ftp_filename:
with open(f"{output_dir}/{study_id}/{ftp_filename}", "wb") as data_dict_file:
ftp.retrbinary(f"RETR {ftp_filename}", data_dict_file.write)
ftp.quit()
return True


def main():
# Delete any existing output dirs so you can ensure all datasets are fresh
#if os.path.isdir(output_dir):
# shutil.rmtree(output_dir)

# Make new output dir
os.makedirs(f"{output_dir}/", exist_ok=True)

# Parse input table and download all valid dbgap datasets to output
missing_data_dict_studies = {}
studies = {}

with open(input_file) as csv_file:
csv_reader = csv.DictReader(csv_file, delimiter="\t")
header = False
for row in csv_reader:
if not header:
# Check to make sure tsv contains column for Study Accession
if "Study Accession" not in row:
# Throw error if expected column is missing
raise IOError("Input file must contain 'Study Accession' column")
header = True
continue

# Get platform and make subdir if necessary
platform = row["Platform"].split(";")
platform = platform[0] if "BDC" not in platform else "BDC"

# Add any phs dbgap studies to queue of files to get
study_id = row["Study Accession"]
if study_id.startswith("phs") and study_id not in studies:
studies[study_id] = True
try:
# Try to download to output folder if the study hasn't already been downloaded
if not os.path.exists(f"{output_dir}/{platform}/{study_id}"):
print(f"Downloading: {study_id}")
if not download_dbgap_study(study_id, f"{output_dir}/{platform}"):
missing_data_dict_studies[study_id] = True

except Exception as e:
# If anything happens, delete the folder so we don't mistake it for success
shutil.rmtree(f"{output_dir}/{platform}/{study_id}")

# Count the number subdir currently in output_dir as the number of downloaded
num_downloaded = len([path for path in os.walk(output_dir) if path[0] != output_dir])

# Get number of failed for missing data dicts
num_missing_data_dicts = len(list(missing_data_dict_studies.keys()))

# Total number of possible unique studies
num_possible = len(list(studies.keys()))

# Write out list of datasets with no data dicts
with open(f"{output_dir}/download_summary.txt", "w") as sum_file:
sum_file.write(f"Unique dbgap datasets in ncpi table: {num_possible}\n")
sum_file.write(f"Successfully Downloaded: {num_downloaded}\n")
sum_file.write(f"Total dbgap datasests missing data dicts: {num_missing_data_dicts}\n")
sum_file.write(f"Dbgap datasests missing data dicts:\n")
for item in missing_data_dict_studies:
sum_file.write(f"{item}\n")

print(f"Unique dbgap datasets in ncpi table: {num_possible}\n")
print(f"Successfully Downloaded: {num_downloaded}\n")
print(f"Total dbgap datasests missing data dicts: {num_missing_data_dicts}\n")


if __name__ == "__main__":
main()
Binary file added data/AnVIL.tar.gz
Binary file not shown.
Binary file added data/BDC.tar.gz
Binary file not shown.
Binary file added data/CRDC.tar.gz
Binary file not shown.
Binary file added data/KFDRC.tar.gz
Binary file not shown.
Loading

0 comments on commit 0507cab

Please sign in to comment.