Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline parameterize restructure #95

Merged
merged 197 commits into from
Apr 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
197 commits
Select commit Hold shift + click to select a range
290ca53
roger cli preped for Merge Deploy
Mar 23, 2023
1e46a36
Update Makefile to work with python env
Mar 29, 2023
7baf7ef
Update redisgraph-bulk-loader to fix issue with loading MODULE LIST
Mar 29, 2023
0f5e35e
Revert "Update redisgraph-bulk-loader to fix issue with loading MODUL…
Mar 29, 2023
ee17823
Finalized dev deployment of dug inside Catapult Merge, deployment yam…
Mar 31, 2023
135d54b
updated to reflect the Dug-Api updates to FastAPI
Apr 17, 2023
bb148a4
adding multi label redis by removing 'biolink:' on nodes, edges canno…
May 5, 2023
b6ea2c7
Working multi label redis nodes w/ no biolink label
braswent May 24, 2023
5b85480
Latest code changes to deploy working Roger in Merge
braswent May 25, 2023
0083b60
biolink data move to '.' separator
braswent Jun 9, 2023
c7b3a71
updates to include new dug fixes, upgraded redis-bulk-loader and made…
braswent Jun 20, 2023
50f5b23
adding test roger code
braswent Jun 21, 2023
444fd16
removed helm deployments
braswent Aug 9, 2023
ce54b73
Merge remote-tracking branch 'rti/develop' into rti-merge
YaphetKG Sep 29, 2023
28fb94b
change docker owner
YaphetKG Sep 29, 2023
142e874
remove core.py
YaphetKG Sep 29, 2023
3b4ed65
remove dup dev config
YaphetKG Sep 29, 2023
d6a779a
redis graph is not directly used removing cruft
YaphetKG Sep 29, 2023
1df8ed5
remove print statement
YaphetKG Sep 29, 2023
e608fa1
remove logging files
YaphetKG Sep 29, 2023
d2cee55
update requriemtns
YaphetKG Sep 29, 2023
89ba673
update requriemtns
YaphetKG Sep 29, 2023
82161c3
add redis graph.py
YaphetKG Oct 2, 2023
8ef6b49
fix import error for logger
YaphetKG Oct 2, 2023
d03acb0
adding es scheme and ca_path config
YaphetKG Oct 2, 2023
16e3d3a
adding es scheme and ca_path config
YaphetKG Oct 2, 2023
a84e926
Parameterized annotate tasks with input_data_path and output_data_path
mbacon-renci Oct 2, 2023
0beba39
adding debug code
YaphetKG Oct 3, 2023
5725045
removing debug
YaphetKG Oct 3, 2023
ae8b0fd
adding nodes args
YaphetKG Oct 3, 2023
a9e422f
adding biolink.
YaphetKG Oct 3, 2023
5c37b91
adding biolink.
YaphetKG Oct 3, 2023
662d153
Parameterized annotate tasks with input_data_path and output_data_pat…
mbacon-renci Oct 10, 2023
1c212dc
adding lakefs changes to roger-2.0
YaphetKG Oct 10, 2023
f7feb78
point avalon to vg1 branch
YaphetKG Oct 10, 2023
e335ce4
change avalon dep
YaphetKG Oct 10, 2023
42ab8d1
update airflow
YaphetKG Oct 11, 2023
148c3f6
fix avalon tag typo
YaphetKG Oct 11, 2023
f9157ac
update jenkins to tag version on main branch only
YaphetKG Oct 11, 2023
86ad101
update jenkins to tag version
YaphetKG Oct 11, 2023
793e396
update jenkins to tag version
YaphetKG Oct 11, 2023
044e4f1
psycopg2 installation
YaphetKG Oct 11, 2023
0996b16
add cncf k8s req
YaphetKG Oct 11, 2023
30476cd
use airflow non-slim
YaphetKG Oct 11, 2023
1727c56
simplified for testing
YaphetKG Oct 11, 2023
dee9b68
simplified for testing
YaphetKG Oct 11, 2023
1696564
change dag name
YaphetKG Oct 11, 2023
b749316
Erroneous parameter passed, should not be None
mbacon-renci Oct 11, 2023
ab5307b
Merge branch 'develop' into patch/input_output_path_params
mbacon-renci Oct 11, 2023
c3a4aa2
Catching up to roger-2.0 build
mbacon-renci Oct 11, 2023
cda7389
adding pre-exec
YaphetKG Oct 12, 2023
322d888
adding pre-exec
YaphetKG Oct 12, 2023
0922c5a
adding pre-exec
YaphetKG Oct 12, 2023
ed3008a
typo preexec
YaphetKG Oct 12, 2023
0155f3e
typo preexec
YaphetKG Oct 12, 2023
d3c7c24
fix context
YaphetKG Oct 12, 2023
a190c65
get files from repo
YaphetKG Oct 17, 2023
ebc49ed
get files from repo
YaphetKG Oct 17, 2023
3716179
get files from repo
YaphetKG Oct 17, 2023
81f69a4
get files from repo
YaphetKG Oct 17, 2023
84ceaf7
First shot at moving pipeline into base class and implementing. Anvil…
mbacon-renci Oct 17, 2023
7ee69fe
Syntax fix, docker image version bump to airflow 2.7.2-python3.11
mbacon-renci Oct 17, 2023
b13c4c0
update storage dir
YaphetKG Oct 17, 2023
1ca6ead
update remove dir code
YaphetKG Oct 17, 2023
694cc05
update remove dir code
YaphetKG Oct 17, 2023
3341ef8
remote path to *
YaphetKG Oct 17, 2023
80793bb
fix input dir for annotators
YaphetKG Oct 17, 2023
7233742
fix input dir for annotators
YaphetKG Oct 17, 2023
4984dfb
fix input dir for annotators
YaphetKG Oct 17, 2023
bff30a9
kwargs to task
YaphetKG Oct 17, 2023
58be140
kwargs to task
YaphetKG Oct 17, 2023
d107d53
kwargs to task
YaphetKG Oct 17, 2023
a56d239
kwargs to task
YaphetKG Oct 17, 2023
ecfd4e2
kwargs to task
YaphetKG Oct 18, 2023
08a9ac4
kwargs to task
YaphetKG Oct 18, 2023
92bdb18
kwargs to task
YaphetKG Oct 18, 2023
00dc858
kwargs to task
YaphetKG Oct 18, 2023
3751ed4
kwargs to task
YaphetKG Oct 18, 2023
eb473a7
kwargs to task
YaphetKG Oct 18, 2023
6cc7984
kwargs to task
YaphetKG Oct 18, 2023
3825dbb
adding branch info on lakefs config
YaphetKG Oct 18, 2023
93ad480
callback push to branch
YaphetKG Oct 18, 2023
bef2d3f
back to relative import
YaphetKG Oct 18, 2023
802c499
reformat temp branch name based on unique task id
YaphetKG Oct 18, 2023
cabecb4
add logging
YaphetKG Oct 18, 2023
29a7d65
add logging
YaphetKG Oct 18, 2023
5211e6f
convert posix path to str for avalon
YaphetKG Oct 18, 2023
585674d
add extra / to root path
YaphetKG Oct 18, 2023
e95303d
New dag created using DugPipeline subclasses
mbacon-renci Oct 20, 2023
c105af6
EmptyOperator imported from wrong place
mbacon-renci Oct 20, 2023
fd1aa05
import and syntax fixes
mbacon-renci Oct 20, 2023
55e9970
utterly silly syntax error
mbacon-renci Oct 21, 2023
61afe68
Added anvil to default input data sets for testing purposes
mbacon-renci Oct 21, 2023
7689203
adding / to local path
YaphetKG Oct 23, 2023
4357fbe
commit meta task args empty string
YaphetKG Oct 23, 2023
d985d58
add merge logic
YaphetKG Oct 23, 2023
04bb9dc
add merge logic
YaphetKG Oct 23, 2023
79d5936
upstream task dir pull for downstream task
YaphetKG Oct 23, 2023
add4ce3
Switched from subdag to taskgroup because latest Airflow depricated s…
mbacon-renci Oct 23, 2023
9116311
Added BACPAC pipeline object
mbacon-renci Oct 23, 2023
9f92ec0
Temporarily ignoring configuration variable for enabled datasets for …
mbacon-renci Oct 23, 2023
813de3d
Passed dag in to create task group to see if it helps dag errors
mbacon-renci Oct 23, 2023
af79f5d
Fixed silly syntax error
mbacon-renci Oct 23, 2023
6bffb12
adding input / output dir params for make kgx
YaphetKG Oct 23, 2023
d39371b
Trying different syntax to make taskgroups work.
mbacon-renci Oct 23, 2023
c3e57ea
adding input / output dir params for make kgx
YaphetKG Oct 23, 2023
738c3f0
Parsing, syntax, pylint fixes
mbacon-renci Oct 23, 2023
c8ca1ed
adding input / output dir params for make kgx
YaphetKG Oct 23, 2023
e3bd1c8
Added pipeline name to task group name to ensure uniqueness
mbacon-renci Oct 24, 2023
764a74a
oops, moved something out of scope. Fixed
mbacon-renci Oct 24, 2023
583178a
Filled out pipeline with methods from dug_utils. Needs data path changes
mbacon-renci Oct 26, 2023
f56a44f
Finished implementing input_data_path and output_data_path handling, …
mbacon-renci Nov 3, 2023
64711fc
Merged lakefs-2.0 changes into restructure branch
mbacon-renci Nov 3, 2023
dfe99e3
Update requirements.txt
YaphetKG Dec 11, 2023
7522a10
adding toggle to avoid sending config obj
YaphetKG Dec 11, 2023
a9eecd7
adding toggle to avoid sending config obj
YaphetKG Dec 11, 2023
05df5b1
disable to string for test
YaphetKG Dec 12, 2023
08ab1f0
control pipelines for testing
YaphetKG Dec 12, 2023
9819cf6
add self to anvil get files
YaphetKG Dec 12, 2023
95490ae
add log stream to make it available
YaphetKG Dec 12, 2023
c3fc968
typo fix
YaphetKG Dec 12, 2023
e66e446
correcting branch id
YaphetKG Dec 13, 2023
7c1cb91
adding source repo
YaphetKG Dec 13, 2023
ccbee52
adding source repo
YaphetKG Dec 13, 2023
ede3ba2
patch name-resolver response
YaphetKG Dec 13, 2023
f610c79
no pass input repo and branch , if not overriden to pre-exec
YaphetKG Dec 14, 2023
5588623
no pass input repo and branch , if not overriden to pre-exec
YaphetKG Dec 14, 2023
a19acbf
no pass input repo and branch , if not overriden to pre-exec
YaphetKG Dec 14, 2023
1cf283f
dug pipeline edit
YaphetKG Dec 14, 2023
65affe6
recurisvely find recursively
YaphetKG Dec 14, 2023
b197d3d
recurisvely find recursively
YaphetKG Dec 14, 2023
8864f6a
setup output path for crawling
YaphetKG Dec 14, 2023
2c01862
all task functions should have input and output params
YaphetKG Dec 14, 2023
f09dc04
adding annotation as upstream for validate index
YaphetKG Dec 14, 2023
cb5af24
revamp create task , and task wrapper
YaphetKG Dec 15, 2023
ba3809b
add validate concepts index task
YaphetKG Dec 15, 2023
4dc9ed0
adding concept validation
YaphetKG Dec 18, 2023
0b7d0c4
add index_variables task as dependecy for validate concepts
YaphetKG Dec 18, 2023
34505c7
add index_variables task as dependecy for validate concepts
YaphetKG Dec 18, 2023
08cf965
await client exist
YaphetKG Dec 19, 2023
5dd29bd
await client exist
YaphetKG Dec 19, 2023
eaf238d
concepts not getting picked up for indexing
YaphetKG Dec 19, 2023
c6f89e4
concepts not getting picked up for indexing
YaphetKG Dec 19, 2023
8b4aa83
Merge branch 'develop' into pipeline_parameterize_restructure
YaphetKG Dec 19, 2023
ca06e0b
fix search elements
YaphetKG Dec 19, 2023
b73602e
converting annotation output to json
YaphetKG Dec 19, 2023
ef6ed63
json format annotation outputs
YaphetKG Dec 19, 2023
c53e377
adding support for json format elements and concepts read
YaphetKG Dec 19, 2023
31eeb45
json back to dug objects
YaphetKG Dec 19, 2023
cda0d6d
fixing index valriables with json objects
YaphetKG Dec 19, 2023
4bd0ee5
indetation and new line for better change detection :?
YaphetKG Dec 19, 2023
ce24e45
indetation and new line for better change detection
YaphetKG Dec 19, 2023
af4001a
treat dictionary concepts as dictionary
YaphetKG Dec 19, 2023
ada224d
read concepts json as a dict
YaphetKG Dec 19, 2023
b6fa0bb
concepts files are actually file paths
YaphetKG Dec 19, 2023
bbe946c
debug message
YaphetKG Dec 19, 2023
9c72c1c
make output jsonable
YaphetKG Dec 19, 2023
9b636cd
clear up dir after commit , and delete unmerged branch even if no cha…
YaphetKG Dec 20, 2023
6891596
don`t clear indexes, parallel dataset processing will be taxed
YaphetKG Dec 20, 2023
6f0f4a5
memory leak?
YaphetKG Dec 20, 2023
c0250b5
memory leak?
YaphetKG Dec 20, 2023
0b2e103
memory leak?
YaphetKG Dec 20, 2023
0cc5eeb
dumping pickles to debug locally
YaphetKG Dec 20, 2023
c7e3f96
find out why concepts are being added to every other element
YaphetKG Dec 21, 2023
de3fd93
find out why concepts are being added to every other element
YaphetKG Dec 21, 2023
f7d0bd8
pointless shuffle 🤷‍♂️
YaphetKG Dec 21, 2023
3982b12
revert back in time
YaphetKG Dec 21, 2023
6370afd
back to sanitize dug
YaphetKG Dec 21, 2023
c1821e5
output just json for annotation
YaphetKG Dec 21, 2023
18f3d29
adding jsonpickle
YaphetKG Dec 21, 2023
87adfe0
jsonpickle 🥒
YaphetKG Dec 21, 2023
daaca7a
unpickle for index
YaphetKG Dec 21, 2023
905f151
unpickle for validate index
YaphetKG Dec 21, 2023
35daa88
crawling fixes
YaphetKG Dec 21, 2023
8d0ca0d
crawling fixes
YaphetKG Dec 21, 2023
d246476
crawling validation fixes
YaphetKG Dec 21, 2023
8850681
fix index concepts
YaphetKG Dec 21, 2023
abaa33a
fix makekgx
YaphetKG Dec 21, 2023
ba2a7d8
adding other bdc pipelines
YaphetKG Dec 21, 2023
da266d6
adding pipeline paramters to be able to configure per instance
YaphetKG Dec 21, 2023
6a2e69c
fix
YaphetKG Dec 21, 2023
592a32f
add input dataset for pipelines
YaphetKG Dec 21, 2023
7588a66
Adding README to document how to create data set-specific pipelines
mbacon-renci Dec 21, 2023
7f8f482
catchup on base.py
mbacon-renci Dec 21, 2023
84cd683
Diverge/catchup merge to add README.md
mbacon-renci Dec 21, 2023
f133b86
Added dbgap and nida pipelines
mbacon-renci Dec 21, 2023
705b3db
fix import errors
YaphetKG Dec 21, 2023
f79350a
Merge branch 'develop' into pipeline_parameterize_restructure
YaphetKG Dec 22, 2023
21e4271
annotator modules added by passing config val (#90)
braswent Jan 29, 2024
d3e6c53
Add heal parsers (#96)
YaphetKG Feb 8, 2024
45d1182
Add heal parsers (#97)
YaphetKG Feb 13, 2024
8dfeeb7
Radx pipeline (#99)
YaphetKG Mar 12, 2024
6852ca3
adding indexes as part of bulk loader paramters
YaphetKG Mar 12, 2024
38b8d01
fix id index cli arg
YaphetKG Mar 13, 2024
08fb4fd
fix local cli
YaphetKG Apr 18, 2024
5209aed
dug latest
YaphetKG Apr 18, 2024
9e900b8
Merge branch 'develop' into pipeline_parameterize_restructure
YaphetKG Apr 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ DATA_DIR=./local_storage

DUG_LOG_LEVEL=INFO

ELASTIC_PASSWORD=12345
ELASTIC_API_HOST=elasticsearch
ELASTIC_USERNAME=elastic
ELASTICSEARCH_PASSWORD=12345
ELASTICSEARCH_HOST=elasticsearch
ELASTICSEARCH_USERNAME=elastic

NBOOST_API_HOST=nboost

Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Git ignore bioler plate from https://github.com/github/gitignore/blob/master/Python.gitignore
.secret-env
.vscode/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
10 changes: 4 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
FROM apache/airflow:2.5.0-python3.10
FROM apache/airflow:2.7.2-python3.11

USER root
RUN apt-get update && \
apt-get install -y git gcc python3-dev nano vim
apt-get install -y git nano vim
COPY requirements.txt requirements.txt
USER airflow
# dependency resolution taking hours eventually failing,
# @TODO fix click lib dependency
RUN pip install -r requirements.txt && \
pip uninstall -y elasticsearch-dsl
RUN pip install -r requirements.txt
RUN rm -f requirements.txt
32 changes: 8 additions & 24 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -70,31 +70,15 @@ spec:
steps {
script {
container(name: 'kaniko', shell: '/busybox/sh') {
kaniko.buildAndPush("./Dockerfile", ["$IMAGE_NAME:$TAG1", "$IMAGE_NAME:$TAG2", "$IMAGE_NAME:$TAG3", "$IMAGE_NAME:$TAG4"])
if (env.BRANCH_NAME == "main") {
// Tag with latest and version iff when pushed to master
kaniko.buildAndPush("./Dockerfile", ["$IMAGE_NAME:$TAG1", "$IMAGE_NAME:$TAG2", "$IMAGE_NAME:$TAG3", "$IMAGE_NAME:$TAG4"])
} else {
kaniko.buildAndPush("./Dockerfile", ["$IMAGE_NAME:$TAG1", "$IMAGE_NAME:$TAG2"])
}
}
}
}
// post {
// always {
// archiveArtifacts artifacts: 'image.tar', onlyIfSuccessful: true
// }
// }
}
// stage('Publish') {
// steps {
// script {
// container(name: 'crane', shell: '/busybox/sh') {
// def imageTagsPushAlways = ["$IMAGE_NAME:$TAG1", "$IMAGE_NAME:$TAG2"]
// def imageTagsPushForDevelopBranch = ["$IMAGE_NAME:$TAG3"]
// def imageTagsPushForMasterBranch = ["$IMAGE_NAME:$TAG3", "$IMAGE_NAME:$TAG4"]
// image.publish(
// imageTagsPushAlways,
// imageTagsPushForDevelopBranch,
// imageTagsPushForMasterBranch
// )
// }
// }
// }
// }
}
}
}
}
103 changes: 0 additions & 103 deletions dags/annotate.py

This file was deleted.

44 changes: 44 additions & 0 deletions dags/annotate_and_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
"""DAG which performs Dug annotate and index operations

This DAG differes slightly from prior versions of the same functionality in
Roger not only in that the annotation and indexing happen in the same DAG, but
also those tasks are broken out into sub-DAGs organized by dataset. Each dataset
has a subdag for all tasks.
"""

import os

from airflow.models import DAG
from airflow.operators.empty import EmptyOperator
from roger.tasks import default_args, create_pipeline_taskgroup

env_enabled_datasets = os.getenv(
"ROGER_DUG__INPUTS_DATA__SETS", "topmed,anvil").split(",")

with DAG(
dag_id='annotate_and_index',
default_args=default_args,
schedule_interval=None
) as dag:
init = EmptyOperator(task_id="init", dag=dag)
finish = EmptyOperator(task_id="finish", dag=dag)

from roger import pipelines
from roger.config import config
envspec = os.getenv("ROGER_DUG__INPUTS_DATA__SETS","topmed:v2.0")
data_sets = envspec.split(",")
pipeline_names = {x.split(':')[0]: x.split(':')[1] for x in data_sets}
for pipeline_class in pipelines.get_pipeline_classes(pipeline_names):
# Only use pipeline classes that are in the enabled datasets list and
# that have a properly defined pipeline_name attribute

# TODO
# Overriding environment variable just to see if this is working.
# name = getattr(pipeline_class, 'pipeline_name', '*not defined*')
# if not name in env_enabled_datasets:
# continue

# Do the thing to add the pipeline's subdag to the dag in the right way
# . . .

init >> create_pipeline_taskgroup(dag, pipeline_class, config) >> finish
Loading