Google-Health · armantajback · Feb 14, 2024 · Feb 14, 2024
diff --git a/path-foundation/README.md b/path-foundation/README.md
@@ -1 +1,208 @@
-### Pathology Foundation
+# Path Foundation
+
+Path Foundation is a tool that enables users to transform pathology imaging
+into a machine learning representation, embeddings. Embeddings are a list of
+floating point values that represent a projection of the original image into a
+compressed feature space. These embeddings can be used to develop custom
+machine learning models for pathology use-cases with less data and compute
+compared to traditional model development methods.
+
+You can read more about the research and underlying model in our recent
+publication
+[Domain-specific optimization and diverse evaluation of self-supervised models for histopathology](https://arxiv.org/abs/2310.13259).
+
+## How to use the Path Foundation API
+
+1. [Decide how to gain access](#how-to-gain-access).
+
+1. With the individual or group email identity at hand from the previous step, fill out the [API access form](http://bit.ly/fm-path-access-form).
+
+1. Once approved for non-clinical use, your provided Google identity(ies) will gain access to the API and test data. They will be notified via the provided email address and can start using the API.
+
+1. Once notified, the approved identities can use [the demo Colab](https://github.com/Google-Health/imaging-research/blob/master/path-foundation/linear-classifier-demo.ipynb) to train a sample linear classifier. They can experiment with [our sample digitized pathology images & training labels](#use-our-test-data) to understand the API, then modify the Colab to use [your own data](#use-your-own-data). The Colab includes tutorials for:
+
+   * Generating training labels in JSON format from masks in PNG format.
+   * Generating a temporary access token for the API to read the DICOM images from a [Cloud DICOM Store](https://cloud.google.com/healthcare-api/docs/concepts/dicom) on behalf of the person running the Colab.
+   * Calling the API to train a linear classifier using images from a Cloud DICOM store and training labels from a [Cloud Storage (GCS) bucket](https://cloud.google.com/storage).
+
+1. [Contact us](#contact) if you find training your custom model is more involved and requires more advanced batching. We're happy to help!
+
+## How to gain access
+You have the option to request access to the API either as [an individual](#as-an-individual) or for [a group](#as-a-group-recommended). Choose the process that best aligns with your needs. Remember to note the email identifier for which you will be requesting access. It should be in one of these formats:
+
+* YOUR-GROUP-NAME@YOUR-DOMAIN
+* INDIVIDUAL-ID@YOUR-DOMAIN
+* [email protected] (not recommended for more involved research projects at large organizations)
+
+### As a group (recommended)
+If your organization is a Google Workspace or Google Cloud Platform (GCP) customer, contact your Google admin and ask them to create a group with the list of individuals who will be using the API. Let them know that this group is used for contacting you and also as a security principal for authorizing your access to the API.
+
+![Create Google Group](img/create-group.png)
+
+Otherwise, [create a free Cloud Identity Account](https://cloud.google.com/identity/docs/set-up-cloud-identity-admin) for your domain name and in the process become the interim Google admin for your organization. Visit [Google Admin Console](https://admin.google.com/) and create the above-mentioned group. If your individual identities are unknown to Google, they will need to follow the process for the [individuals](#as-an-individual) before you can add them to the group.
+
+### As an individual
+
+If your organization is a Google Workspace or GCP customer, identity federation is most likely set up between your corporate identity directory and [Google Identity and Access Management](https://cloud.google.com/security/products/iam) and therefore individuals already have Google identities in the form of their corporate emails. Check with your IT department to find out whether identity federation is already in place or will be established soon.
+
+Otherwise, [create a Google identity based on your email](https://accounts.google.com/signup/v2/webcreateaccount?flowName=GlifWebSignIn&flowEntry=SignUp). Opt for the "use my current email address instead" option, as shown in the screen capture below.
+
+IMPORTANT: You should choose a password that is different from your corporate password.
+
+![Create Google Id](img/create-identity.png)
+
+NOTE: If you want to sign up as an individual with a gmail account you don't need to create a Google identity and can skip the above step.
+
+## Use our test data
+
+Upon gaining access to the API, you'll also have access to publicly available data we've curated specifically for testing. This is to help you get started with your initial experiments. The default state of the demo Colab is set to use this test data, i.e. DICOM images stored in a Cloud DICOM Store and training labels in PNG and JSON formats in a GCS bucket. As you become more familiar with the demo Colab, you have the option to modify it to [work with your data](#use-your-own-data).
+
+## Use your own data
+
+To use your own data with the API, you will need the following GCP resources:
+
+* A [GCP Project](https://cloud.google.com/storage/docs/projects)
+* A Cloud DICOM Store in the project for storing digitized pathology images
+* A GCS bucket in the project for storing data in file format (i.e. training labels, embeddings, and DICOM files)
+
+WARNING: You hold responsibility for the data that you use with the API. It's important to comply with all the applicable regulations and policies that govern the use of your data.
+
+WARNING: If your organization is already a GCP user, ensure that you follow approved methods for storing data and granting access to [your chosen identity](#how-to-gain-access), in line with your organization's data privacy and security policies. The instructions in this section should only be used if your organization's policies permit experimenting with de-identified data in ungoverned storage systems.
+
+WARNING: While the API can read data from any [DICOMweb-compliant](https://www.dicomstandard.org/using/dicomweb) storage system, Google Cloud DICOM Store is optimized for the scale and latency required for handling [digitized pathology images](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-digital-pathology). We cannot guarantee the same performance or functionality with other storage systems.
+
+NOTE: This guide assumes you can create GCP resources interactively through the [Google Cloud Console](https://console.cloud.google.com). If your organization's policies  prevents that, please contact us for assistance in presenting the resource requirements as infrastructure-as-code.
+
+NOTE: The demo Colab demonstrates how to call the API using short-lived access tokens. These tokens permit the API to read and process the images on behalf of the individual who is running the Colab. It's important to note that the API cannot access your data independently. The API processes images when you instruct it to using a time-limited access token and does not store the images after processing.
+
+
+1. If you don't have access to an existing GCP Project, you need to [create one](https://cloud.google.com/free).
+
+1. Follow [these instructions](https://cloud.google.com/storage/docs/creating-buckets) to create the GCS bucket.
+
+1. Follow [these instructions](https://cloud.google.com/healthcare-api/docs/how-tos/dicom) to create a Cloud DICOM Store.
+
+1. Use [Google Cloud IAM panel](https://console.cloud.google.com/iam-admin) to grant the following permissions to the GCP resources:
+
+   * Allow the individual running the rest of the steps to manage objects in the GCS bucket by granting them the predefined role `roles/storage.objectAdmin`.
+
+   * Allow [the identity(ies) who have access to our API](#how-to-gain-access) to:
+      * read training labels and persist embeddings in the GCS bucket by granting them the predefined role `roles/storage.objectAdmin`.
+      * read DICOM images from the Cloud DICOM Store by granting them the predefined role `roles/healthcare.dicomViewer`.
+
+1. On your local machine [install the gcloud SDK](https://cloud.google.com/sdk/docs/install) and [log in](https://cloud.google.com/sdk/gcloud/reference/auth/login):
+
+        gcloud auth application-default login
+
+1. From your local machine use the [gcloud storage commands](https://cloud.google.com/sdk/gcloud/reference/storage) to transfer training labels in PNG or JSON format and DICOM files to the GCS bucket. You may use the [`rsync` command](https://cloud.google.com/sdk/gcloud/reference/storage/rsync) instead of `cp` to handle the large volume of files that's typical for digitized pathology use cases.
+
+1. Follow [these instructions](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-import-export#gcloud) to bulk import DICOM files from the GCS bucket to your Cloud DICOM Store.
+
+1. Modify the demo Colab to point to your data:
+
+# TODO(armant): please add  a direct link to the relevant code cell in the Colab
+
+  * To use your training labels, replace `hai-cd3-foundations-pathology-vault-entry` with the name of your GCS bucket.
+
+  * To use your DICOM images, change the the Cloud DICOM Store urls. They take the following format: `https://healthcare.googleapis.com/v1/projects/YOUR_PROJECT_ID/locations/YOUR_LOCATION/datasets/YOUR_DATASET_ID/dicomStores/YOUR_DICOM_STORE_ID/`. You need to substitute `YOUR_PROJECT_ID` with the project Id you obtained in step 1 and `YOUR_LOCATION`, `YOUR_DATASET_ID`, `YOUR_DICOM_STORE_ID` from step 3.
+
+## General notes
+
+* Google does not keep a copy of any DICOM images processed.
+* Google monitors daily query volume and aggregates on a per-user and
+  per-organization basis. Access can be revoked if a user or organization
+  exceeds a reasonable query volume.
+
+## Contributing
+
+See [`CONTRIBUTING.md`](CONTRIBUTING.md) for details.
+
+## License
+
+See [`LICENSE`](LICENSE) for details.
+
+# Model Card for Path Foundation Model
+
+This tool uses an ML model to provide the embedding results. This section
+briefly overviews the background and limitations of that model.
+
+## Model Details
+
+This self-supervised model produces embeddings for image patches from
+histopathology whole slide images (WSIs). Embeddings are n-dimensional vectors
+of floating point values that represent a projection of the original image into
+a compressed feature space. The model uses the ViT-S architecture and was
+trained across magnifications with domain specific tuning and optimization. The
+resulting feature representations provided by the model offer robust input
+for downstream tasks in histopathology. Additional information can be found in
+the preprint [manuscript](https://arxiv.org/abs/2310.13259).
+
+### Version
+  * Version: 1.0.0
+  * Date: 2023-12-19
+
+### License
+  Research use only. Not suitable for product development.
+  - See [Path Foundation - Additional Terms of Service](https://docs.google.com/forms/d/1auyo2VkzlzuiAXavZy1AWUyQHAqO7T3BLK-7ofKUvug/viewform?edit_requested=true).
+
+### Manuscript
+  https://arxiv.org/abs/2310.13259
+
+### Contact
+  [email protected]
+
+
+### Intended Use
+The PathSSL model can reduce the training data, compute, and technical expertise
+necessary to develop task-specific models for H&E pathology slides.
+Embeddings from the model can be used for a variety of user-defined downstream
+tasks including, but not limited to:  cancer detection, classification, and
+grading; metadata prediction (stain, tissue type, specimen type, etc.); and
+quality assessment (e.g., imaging artifacts).
+The embeddings can also be used to explore the feature space of histopathology
+images for biomarker development associated with prognostic and predictive
+tasks.
+
+### Training Data
+Training data consisted of hematoxylin and eosin stained (H&E) WSIs from The
+Cancer Genome Atlas (TCGA) accessed via https://portal.gdc.cancer.gov.
+Training was performed using 60 million patches across three magnifications
+(~2 µm/pixel, ~1 µm/pixel, ~0.5 µm/pixel) and 32 TCGA studies (representing
+different cancer types).
+
+### Performance & Validation
+Linear probe evaluation was conducted across a diverse set of benchmark tasks
+involving 17 unique tissue types and 12 unique cancer types and spanning
+different optimal magnifications and task types.
+See preprint manuscript for more details including performance on additional
+slide-level tasks (eg. tissue type classification and molecular findings) as
+well as results for data titration with fine tuning for select tasks.
+
+### Risks
+Although Google does not store any data sent to this model, it is the data
+owner's responsibility to ensure that Personally identifiable information (PII)
+and Protected Health Information (PHI) are removed prior to being sent to the
+model.
+Mitigation Strategy: Do not send data containing PII or PHI.
+Training dataset is a de-identified public dataset and pathology imaging (pixel
+data) does not contain PHI.
+
+### Limitations
+Intended for research purposes only
+The model has only been validated for a limited number of the potential
+downstream tasks involving H&E histopathology interpretation that this model
+could be used for. Task-specific validation remains an important aspect of
+model development by the end-user.
+This version model was trained and validated only on H&E images from a limited
+set of scanners and countries. Model output may not generalize well to data
+from other image types, patient populations, or scanner manufacturers not used
+in training.
+Training and validation was performed on patches corresponding to 5x, 10x, and
+20x magnification (~2 µm/pixel, ~1 µm/pixel, ~0.5 µm/pixel, respectively).
+Using input patches corresponding to magnifications other than these has not
+been evaluated.
+The model is only used to generate embeddings of the user-owned dataset. It
+does not generate any predictions or diagnosis on its own
+Developers should ensure any downstream model developed using this tool is
+validated to ensure performance is consistent against intended use demographics
+and image characteristics e.g., age, sex, gender, scanner etc.