This repo contains the necessary information and download instructions to download the dataset associated with the paper:
This data can be visualized in a public instance of the Digital Slide Archive at this link. If you click the “eye” image icon in the Annotations panel on the right side of the screen, you’ll see the results of a collaborative annotation.
-
Each mask is a .png image, where pixel values encode region class membership. The meaning of ground truth encoded can be found at the file
./meta/gtruth_codes.tsv
. -
The name of each mask encodes all necessary information to extract the corresponding RGB images from TCGA slides. For convenience, RGBs are also downloaded using the code used here.
-
[CRITICAL] - Please be aware that zero pixels represent regions outside the region of interest (“don’t care” class) and should be assigned zero-weight during model training; they do NOT represent an “other” class.
-
The RGBs and corresponding masks will be at the set
MPP
resolution. IfMPP
was set toNone
, then they would be atMAG
magnification. If both are set toNone
, then they will be at the base (scan) magnification.
To reproduce the accuracy results presented in the paper, please be sure to read the paper methodology, including the supplementary methods section. Some details are also re-iterated here for convenience (including train-test splits and class grouping). This Github repository contains the code we used, and the trained VGG16-FCN8 tensorflow model weights can be downloaded at this link.
You can use this link to download the dataset at 0.25 MPP resolution.
Use this to download all elements of the dataset using the command line.
This script will download any or all of the following:
- annotation JSON files (coordinates relative to WSI base resolution)
- masks
- RGB images
Steps are as follows:
Step 0: Clone this repo
$ git clone https://github.com/CancerDataScience/CrowdsourcingDataset-Amgadetal2019
$ cd CrowdsourcingDataset-Amgadetal2019
Step 1: Instal requirements
pip install girder_client girder-client pillow numpy scikit-image imageio
Step 2 (optional): Edit configs.py
If you like, you may edit various download configurations. Of note:
SAVEPATH
- where everything will be savedMPP
- microns per pixel for RGBs and masks (preferred, default is 0.25)MAG
- magnification (overridden byMPP
ifMPP
is set. default is None)PIPELINE
- what elements to download?
Step 3: Run the download script
python download_crowdsource_dataset.py
The script will create the following sub-directories in SAVEPATH
:
|_ annotations : where JSON annotations will be saves for each slide
|_ masks : where the ground truth masks to use for training and validation are saved
|_ images: where RGB images corresponding to masks are saved
|_ wsis (legacy) : Ignore this. No longer supported.
|_ logs : in case anythign goes wrong
This dataset itself is licensed under a CC0 1.0 Universal (CC0 1.0) license. We would appreciate it if you cite our paper if you use the data.
Thise codebase is licensed with an MIT license.