Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 Sql query tutorial #642

Merged
merged 2 commits into from
Dec 9, 2021
Merged

📝 Sql query tutorial #642

merged 2 commits into from
Dec 9, 2021

Conversation

chris-s-friedman
Copy link
Contributor

@chris-s-friedman chris-s-friedman commented Dec 7, 2021

add a tutorial on one method for sourcing data from a postgresql database and moved the tutorial on sourcing from study creator to the same directory as sourcing data from sql.

This is a potential solution for #627

🚚 move index for sourcing data from study creator

now it's in its own dir
@fiendish
Copy link
Contributor

fiendish commented Dec 7, 2021

that's a very clever trick

@nicholasvk
Copy link

Very cool! Just want to chime in with some history/background toward developing standards on when it makes sense to use SQL directly. We thought about doing this for the CBTN refresh work which relies on data sources in the D3b warehouse. However, after discussions as a team with Allison/Bailey, we decided that we should write SQL output to Data Tracker via the study creator API for audit purposes and then use the files on data tracker for actual ingest. Database tables will change over time, rows will be added/deleted and/or updated, so having a record of exactly what is ingested for a given run seemed important at the time. A bonus is that Data Tracker makes this history easily accessible to non-ADAPT team members. You can review the code (authored by Meen and Avi) for that here:
https://github.com/d3b-center/d3b-warehouse-kids-first-refresh/blob/main/etl_from_eig_into_warehouse/dumping.py

Is this the best workflow? Are there other solutions to explore for this? Absolutely, just wanted to add to the conversation with history on the CBTN side.

@chris-s-friedman
Copy link
Contributor Author

@nicholasvk i think those are really good points about

  1. being able to audit what data is used for ingest
  2. the volatility inherent to using sql as a data source - i.e. rows being added/ deleted/ updated
  3. facilitate non-technical folks be able to have eyes on the data being used for ingest.

my thinking proposing a tutorial on how to query sql in an ingest is for two specific circumstances:

  1. ingesting information about genomic files from aws scrapes. instead of having ingesters generate aws manifests, we can pull directly from the file_metadata schema in postgres
  2. provide a tutorial for external users that use the ingest library but don't have the data tracker but do use databases as source data.

that said - I think I should prepend this tutorial with a note about your three points and how querying directly from sql impacts those points. Perhaps even say " instead of querying sql, you may want to create a static view of your database in a single file"

🚨 remove trailing whitespace

✏️ fix spelling and capitalization

Co-authored-by: Giovanni Santia <[email protected]>

✨ add note about considering not querying sql directly

thanks @nicholasvk for the suggestion

:rotating_light: remove more trailing whitespace
Copy link

@nicholasvk nicholasvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated documentation for SQL considerations looks great!

@chris-s-friedman chris-s-friedman merged commit 92889ef into master Dec 9, 2021
@chris-s-friedman chris-s-friedman deleted the sql_query_tutorial branch December 9, 2021 00:04
chris-s-friedman added a commit that referenced this pull request Feb 23, 2022
## Release 1.11.0

### Summary

- Emojis: ? x3, ✨ x3
- Categories: Additions x3, Other Changes x3

### New features and changes

- [#645](#645) -  add gru-npu consent group - [1da6dd7](1da6dd7) by [chris-s-friedman](https://github.com/chris-s-friedman)
- [#644](#644) -  ✨ specify external IDs for clinical markers - [c1f6c1c](c1f6c1c) by [chris-s-friedman](https://github.com/chris-s-friedman)
- [#643](#643) - ✨ Add new sequencing center Tempus - [87f5cdf](87f5cdf) by [youngnm](https://github.com/youngnm)
- [#642](#642) -  Sql query tutorial - [92889ef](92889ef) by [chris-s-friedman](https://github.com/chris-s-friedman)
- [#641](#641) - ✨ Add NIH and Methylation constants - [6a1e70b](6a1e70b) by [youngnm](https://github.com/youngnm)
- [#639](#639) - ✨ add CSIR sequencing Center - [81e5752](81e5752) by [chris-s-friedman](https://github.com/chris-s-friedman)
@fiendish fiendish changed the title Sql query tutorial 📝 Sql query tutorial Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants