Skip to content

Latest commit

 

History

History
356 lines (303 loc) · 12.9 KB

TCGA_Annotations.md

File metadata and controls

356 lines (303 loc) · 12.9 KB

TCGA Annotations

The goal of this notebook is to introduce you to the TCGA Annotations BigQuery table. You can find more detail about Annotations on the TCGA Wiki, but the key things to know are: an annotation can refer to any "type" of TCGA "item" (eg patient, sample, portion, slide, analyte or aliquot), and each annotation has a "classification" and a "category", both of which are drawn from controlled vocabularies. The current set of annotation classifications includes: Redaction, Notification, CenterNotification, and Observation. The authority for Redactions and Notifications is the BCR (Biospecimen Core Resource), while CenterNotifications can come from any of the data-generating centers (GSC or GCC), and Observations from any authorized TCGA personnel. Within each classification type, there are several categories. We will look at these further by querying directly on the Annotations table. Note that annotations about patients, samples, and aliquots are separate from the clinical, biospecimen, and molecular data, and most patients, samples, and aliquots do not in fact have any annotations associated with them. It can be important, however, when creating a cohort or analyzing the molecular data associated with a cohort, to check for the existence of annotations.

In order to work with BigQuery, you need to import the bigrquery library and you need to know the name(s) of the table(s) you are going to be working with:

require(bigrquery) || install.packages("bigrquery")
## [1] TRUE
annotationsTable <- "[isb-cgc:tcga_201510_alpha.Annotations]"

From now on, we will occationally refer to this table using this variable annotationsTable, but we could just as well explicitly give the table name each time. Let's start by taking a look at the table schema:

querySql <- paste("SELECT * FROM ",annotationsTable," limit 1", sep="")
result <- query_exec(querySql, project=project)
data.frame(Columns=colnames(result))
##                     Columns
## 1              annotationId
## 2      annotationCategoryId
## 3    annotationCategoryName
## 4  annotationClassification
## 5        annotationNoteText
## 6                     Study
## 7              itemTypeName
## 8               itemBarcode
## 9            AliquotBarcode
## 10       ParticipantBarcode
## 11            SampleBarcode
## 12                dateAdded
## 13              dateCreated
## 14               dateEdited

Item Types

Most of the schema fields come directly from the TCGA Annotations. First and foremost, an annotation is associated with an itemType, as described above. This can be a patient, an aliquot, etc. Let's see what the breakdown is of annotations according to item-type:

querySql <- "
SELECT itemTypeName, COUNT(*) AS n
FROM [isb-cgc:tcga_201510_alpha.Annotations]
GROUP BY itemTypeName
ORDER BY n DESC"

result <- query_exec(querySql, project=project)
head(result)
##      itemTypeName     n
## 1         Aliquot 12928
## 2 Shipped Portion  1749
## 3         Patient  1378
## 4         Analyte   789
## 5           Slide   552
## 6          Sample   114

The length of the barcode in the itemBarcode field will depend on the value in the itemTypeName field: if the itemType is 'Patient', then the barcode will be something like TCGA-E2-A15J, whereas if the itemType is "Aliquot", the barcode will be a full-length barcode, eg TCGA-E2-A15J-10A-01D-a12N-01.

Annotation Classifications and Categories

The next most important pieces of information about an annotation are the "classification" and "category". Each of these comes from a controlled vocabulary and each "classification" has a specific set of allowed "categories". One important thing to understand is that if an aliquot carries some sort of disqualifying annotation, in general all other data from other samples or aliquots associated with that same patient should still be usable. On the other hand, if a patient carries some sort of disqualifying annotation, then that information should be considered prior to using any of the samples or aliquots derived from that patient.

To illustrate this, let's look at the most frequent annotation classifications and categories when the itemType is Patient:

querySql <- "
SELECT
  annotationClassification,
  annotationCategoryName,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( itemTypeName='Patient' )
GROUP BY
  annotationClassification,
  annotationCategoryName
HAVING ( n >= 50 )
ORDER BY
  n DESC"

result <- query_exec(querySql, project=project)
head(result)
##   annotationClassification
## 1             Notification
## 2             Notification
## 3             Notification
## 4             Notification
## 5             Notification
## 6             Notification
##                                                        annotationCategoryName
## 1                                                            Prior malignancy
## 2                                                   Alternate sample pipeline
## 3 History of unacceptable prior treatment related to a prior/other malignancy
## 4                                                      Synchronous malignancy
## 5                                                         Neoadjuvant therapy
## 6                                                        Item is noncanonical
##     n
## 1 407
## 2 200
## 3 139
## 4 110
## 5 102
## 6  81

The results of the previous query indicate that the majority of patient-level annotations are "Notifications", most frequently regarding prior malignancies. In most TCGA publications, "history of unacceptable prior treatment" and "item is noncanonical" notifications are treated as disqualifying annotations, and all data associated with those patients is not used in any analysis. Let's make a slight modification to the last query to see what types of annotation categories and classifications we see when the item type is not patient:

querySql <- "
SELECT
  annotationClassification,
  annotationCategoryName,
  itemTypeName,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( itemTypeName!='Patient' )
GROUP BY
  annotationClassification,
  annotationCategoryName,
  itemTypeName
HAVING ( n >= 50 )
ORDER BY
  n DESC"

result <- query_exec(querySql, project=project)
head(result)
##   annotationClassification annotationCategoryName    itemTypeName     n
## 1       CenterNotification       Item flagged DNU         Aliquot 12303
## 2             Notification   Item is noncanonical Shipped Portion  1741
## 3             Notification   Item is noncanonical           Slide   541
## 4             Notification   Item is noncanonical         Analyte   464
## 5              Observation                General         Analyte   179
## 6       CenterNotification       Center QC failed         Aliquot   149

The results of the previous query indicate that the vast majority of annotations are at the aliquot level, and more specifically were submitted by one of the data-generating centers, indicating that the data derived from that aliquot is "DNU" (Do Not Use). In general, this should not affect any other aliquots derived from the same sample or any other samples derived from the same patient. We see in the output of the previous query that a Notification that an "Item is noncanonical" can be applied to different types of items (eg slides and analytes). Let's investigate this a little bit further, for example let's count up these types of annotations by study (ie tumor-type):

querySql <- "
SELECT
  Study,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( annotationCategoryName='Item is noncanonical' )
GROUP BY
  Study
ORDER BY
  n DESC"

result <- query_exec(querySql, project=project)
head(result)
##   Study   n
## 1    OV 743
## 2   GBM 519
## 3  KIRC 455
## 4  COAD 314
## 5  LUAD 238
## 6  LUSC 231

and now let's pick one of these tumor types, and delve a little bit further:

querySql <- "
SELECT
  itemTypeName,
  COUNT(*) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( annotationCategoryName='Item is noncanonical'
    AND Study='OV' )
GROUP BY
  itemTypeName
ORDER BY
  n DESC"

result <- query_exec(querySql, project=project)
head(result)
##      itemTypeName   n
## 1           Slide 409
## 2 Shipped Portion 220
## 3         Analyte 110
## 4         Patient   3
## 5          Sample   1

Barcodes

As described above, an annotation is specific to a single TCGA "item" and the fields itemTypeName and itemBarcode are the most important keys to understanding which TCGA item carries the annotation. Because we use the fields ParticipantBarcode, SampleBarcode, and AliquotBarcode throughout our other TCGA BigQuery tables, we have added them to this table as well, but they should be interpreted with some care: when an annotation is specific to an aliquot (ie itemTypeName="Aliquot"), the ParticipantBarcode, SampleBarcode, and AliquotBarcode fields will all be set, but this should not be interpreted to mean that the annotation applies to all data derived from that patient.

This will be illustrated with the following two queries which extract information pertaining to a few specific patients:

querySql <- "
SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( ParticipantBarcode='TCGA-61-1916' )
ORDER BY n ASC"

result <- query_exec(querySql, project=project)
head(result)
##   Study itemTypeName          itemBarcode annotationCategoryName
## 1    OV      Patient         TCGA-61-1916 Item in special subset
## 2    OV      Analyte TCGA-61-1916-01A-01T   Item is noncanonical
## 3    OV      Analyte TCGA-61-1916-01A-01W   Item is noncanonical
## 4    OV      Analyte TCGA-61-1916-02A-01W   Item is noncanonical
## 5    OV      Analyte TCGA-61-1916-11A-01W   Item is noncanonical
## 6    OV      Analyte TCGA-61-1916-01A-01G   Item is noncanonical
##   annotationClassification ParticipantBarcode    SampleBarcode
## 1             Notification       TCGA-61-1916             <NA>
## 2             Notification       TCGA-61-1916 TCGA-61-1916-01A
## 3             Notification       TCGA-61-1916 TCGA-61-1916-01A
## 4             Notification       TCGA-61-1916 TCGA-61-1916-02A
## 5             Notification       TCGA-61-1916 TCGA-61-1916-11A
## 6             Notification       TCGA-61-1916 TCGA-61-1916-01A
##   AliquotBarcode  n
## 1           <NA> 12
## 2           <NA> 20
## 3           <NA> 20
## 4           <NA> 20
## 5           <NA> 20
## 6           <NA> 20
querySql <- "
SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( ParticipantBarcode='TCGA-GN-A261' )
ORDER BY n ASC"

result <- query_exec(querySql, project=project)
head(result)
##   Study itemTypeName  itemBarcode        annotationCategoryName
## 1  SKCM      Patient TCGA-GN-A261 Tumor tissue origin incorrect
## 2  SKCM      Patient TCGA-GN-A261           Neoadjuvant therapy
##   annotationClassification ParticipantBarcode SampleBarcode AliquotBarcode
## 1                Redaction       TCGA-GN-A261          <NA>           <NA>
## 2             Notification       TCGA-GN-A261          <NA>           <NA>
##    n
## 1 12
## 2 12

As you can see in the results returned from the previous two queries, the SampleBarcode and the AliquotBarcode fields may or may not be filled in, depending on the itemTypeName.

querySql <- "
SELECT
 Study,
 itemTypeName,
 itemBarcode,
 annotationCategoryName,
 annotationClassification,
 annotationNoteText,
 ParticipantBarcode,
 SampleBarcode,
 AliquotBarcode,
 LENGTH(itemBarcode) AS n
FROM
  [isb-cgc:tcga_201510_alpha.Annotations]
WHERE
  ( ParticipantBarcode='TCGA-RS-A6TP' )
ORDER BY n ASC"

result <- query_exec(querySql, project=project)
head(result)
##   Study itemTypeName          itemBarcode annotationCategoryName
## 1  HNSC      Analyte TCGA-RS-A6TP-10A-01D                General
##   annotationClassification
## 1              Observation
##                                                                                                                                                                                                            annotationNoteText
## 1 DNA analyte UUID: 8304F61F-C217-4B9F-BA64-6486DA54E6C8 was involved in an extraction protocol deviation wherein an additional column purification step was used as a means of buffer exchange on the column-eluted analyte.
##   ParticipantBarcode    SampleBarcode AliquotBarcode  n
## 1       TCGA-RS-A6TP TCGA-RS-A6TP-10A           <NA> 20

In this example, there is just one annotation relevant to this particular patient, and one has to look at the annotationNoteText to find out what the potential issue may be with this particular analyte. Any aliquots derived from this blood-normal analyte might need to be used with care.