The Open Data Platform for Sustainable, Accessible Lung Radiogenomics
Lung-SARG is a fully open-source and local-first platform that improves how communities collaborate on open data to diagnose lung cancer and perform epidemiology on local populations in low and middle income countries.
Tip
Datasets generated by this project are ready to explore and consume at HuggingFace.
- Open: Code, standards, infrastructure, and data, are public and open source.
- Modular and Interoperable: Each component can be replaced, extended, or removed. Works well in many environments (your laptop, in a cluster, or from the browser), can be deployed to many places (S3 + GH Pages, IPFS, ...) and integrates with multiple tools (thanks to the Arrow and Zarr ecosystems). Use open tools, standards, infrastructure, and share data in accessible formats.
- Data as Code: Declarative stateless transformations tracked in
git
. Improves data access and empowers data scientists to conduct research and helps to guide community-driven analysis and decisions. Version your data as code! Publish and share your reusable models for others to build on top. Datasets should be both reproducible and accessible! - Glue: Be a bridge between tools and approaches. E.g: Use software engineering good practices like types, tests, materialized views, and more.
- FAIR.
- KISS: Minimal and flexible. Rely on tools that do one thing and do it well.
- No vendor lock-in
- Rely on Open code, standards, and infrastructure.
- Use the tool you want to create, explore, and consume the datasets. Agnostic of any tooling or infrastructure provider.
- Standard format for data and APIs! Keep your data as future-friendly and future-proof as possible!
- Distributed: Permissionless ecosystem and collaboration. Open source code and make it ready to be improved.
- Community: that incentives contributors.
- Immutability: Embrace idempotency. Rely on content-addressable storage and append-only logs.
- Stateless and serverless: as much as possible. E.g. use GitHub Pages, host datasets on S3, interface with HTML, JavaScript, and WASM. No servers to maintain, no databases to manage, no infrastructure to worry about. Keep infrastructure management lean.
- Offline-first: Rely on static files and offline-first tools.
- Above all, have fun and enjoy the process 🎉
Lung SARG dataflow.
You can install all the dependencies inside a reproducible software environment via pixi. To do that, install pixi, clone the repository, and run the following command from the root folder.
pixi install -a
To see all tasks available:
pixi task list
Start and access the Dagster UI locally.
pixi run dev
In the Dagster UI, click
Overview -> Jobs -> stage_idc_nsclc_radiogenomic_samples -> Materialize all
Observe what happens in the Overview, Runs, and Assets pages of the Dagster UI, and the content in the lung-sarg/data directory.
This project started after thinking about what an Open Data Protocol could look like!
- This project was built on the principles espoused by David Gasquez at Datonic. It is built on the approach in the Datadex Open Data Platform and extended for scientific imaging data with OME-Zarr and the DICOM-based image data model in the NIH Imaging Data Commons.
- Lung-SARG is possible thanks to amazing open source projects like DuckDB, dbt, Dagster, ITK and many others...
- This project was built with support from Dr. James Gee in collaboration with the UPenn PICSL Lab.