Skip to content

Commit

Permalink
[r] Vignette for new-shape feature (#3302)
Browse files Browse the repository at this point in the history
* New-shape vignette for R

* inst/extdata/soma-exp-pbmc-small-pre-1.15.tar.gz

* Wording updates

* fix prefix-matching for dir-list in `load_dataset`

* code-review feedback

* DESCRIPTION and NEWS.md [skip ci]
  • Loading branch information
johnkerl authored Dec 4, 2024
1 parent ee8e967 commit b4cf341
Show file tree
Hide file tree
Showing 7 changed files with 283 additions and 2 deletions.
4 changes: 3 additions & 1 deletion apis/python/notebooks/tutorial_soma_shape.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
"\n",
"In this notebook, we'll go through how you use shapes for the dataframes and arrays within your SOMA experiments, when and how you can resize, and options for experiments created before TileDB-SOMA 1.15.\n",
"\n",
"The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics. "
"The dataset used is from Peripheral Blood Mononuclear Cells (PBMC), which is freely available from 10X Genomics.\n",
"\n",
"(Please also see the [Academy tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/).)"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion apis/r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Description: Interface for working with 'TileDB'-based Stack of Matrices,
like those commonly used for single cell data analysis. It is documented at
<https://github.com/single-cell-data>; a formal specification available is at
<https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md>.
Version: 1.15.99.17
Version: 1.15.99.18
Authors@R: c(
person(given = "Aaron", family = "Wolen",
role = c("cre", "aut"), email = "[email protected]",
Expand Down
1 change: 1 addition & 0 deletions apis/r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
* More fixes for unit-test cases with dense + core 2.27 [#3280](https://github.com/single-cell-data/TileDB-SOMA/pull/3280)
* Add support for writing Seurat v5 ragged arrays
* Update docstrings for `domain` argument to `create` [#3396](https://github.com/single-cell-data/TileDB-SOMA/pull/3396)
* Vignette for new-shape feature [#3302](https://github.com/single-cell-data/TileDB-SOMA/pull/3302)

# tiledbsoma 1.14.5

Expand Down
1 change: 1 addition & 0 deletions apis/r/R/datasets.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ extract_dataset <- function(name, dir = tempdir()) {

# Extract tar.gz file to dir
tarfile <- dir(data_dir, pattern = name, full.names = TRUE)
tarfile <- file.path(data_dir, sprintf("%s.tar.gz", name))
stopifnot("The specified dataset does not exist" = file.exists(tarfile))

dataset_uri <- file.path(dir, name)
Expand Down
2 changes: 2 additions & 0 deletions apis/r/_pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ navbar:
href: articles/soma-objects.html
- text: Reading From SOMA Objects
href: articles/soma-reading.html
- text: SOMA shapes
href: articles/soma-shapes.html
- text: Querying a SOMA Experiment
href: articles/soma-experiment-queries.html
- text: -------
Expand Down
Binary file not shown.
275 changes: 275 additions & 0 deletions apis/r/vignettes/soma-shapes.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
---
title: "SOMA shapes"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{SOMA shapes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

## Overview

As of TileDB-SOMA 1.15 we're proud to support a more intuitive and extensible notion of shape.

Please also see the [Academy tutorial](https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/).

## Example data

Let's load the bundled `SOMAExperiment` containing a subsetted version of the 10X genomics [PBMC dataset](https://satijalab.github.io/seurat-object/reference/pbmc_small.html) provided by SeuratObject. This will return a `SOMAExperiment` object.

```{r}
library(tiledbsoma)
exp <- load_dataset("soma-exp-pbmc-small")
exp
```

The `obs` dataframe has a `domain`, which is a soft limit on what values can be written to it. (You'll get an error if you try to read or write `soma_joinid` values outside this range, which is an important data-integrity reassurance.)

The `domain` we see here matches with the data populated inside of it. (This will usually be the case. It might not, if you've created the dataframe but not written any data to it yet --- at that point it's empty but it still has a shape.)

If you have more data --- more cells --- to add to the experiment later, you will be able resize the obs, up to the maxdomain which is a hard limit.

```{r}
exp$obs$domain()
```

```{r}
exp$obs$maxdomain()
```

```{r}
head(as.data.frame(exp$obs$read()$concat()))
```

The `var` dataframe's `domain` is similar:

```{r}
var <- exp$ms$get("RNA")$var
```

```{r}
var$domain()
```

```{r}
var$maxdomain()
```

Likewise, the N-dimensional arrays within the experiment have their shapes as well.

There's an important difference: while the dataframe domain gives you the inclusive lower and upper bounds for `soma_joinid` writes, the shape for the N-dimensional arrays is the upper bound plus 1.

Since there are 80 cells here and 230 genes here, `X`'s shape reflects that.

```{r}
obs <- exp$obs
var <- exp$ms$get("RNA")$var
X <- exp$ms$get("RNA")$X$get("data")
```

```{r}
obs$domain()
```

```{r}
var$domain()
```

```{r}
X$shape()
```

```{r}
X$maxshape()
```

The other N-dimensional arrays are similar:

```{r}
obsm <- exp$ms$get("RNA")$obsm
obsm$names()
```

```{r}
obsp <- exp$ms$get("RNA")$obsp
obsp$names()
```

```{r}
list(
obsm$get("X_pca")$shape(),
obsm$get("X_pca")$maxshape()
)
```

```{r}
list(
obsp$get("RNA_snn")$shape(),
obsp$get("RNA_snn")$maxshape()
)
```

In particular, the `X` array in this experiment --- and in most experiments --- is sparse. That means there needn't be a number in every row or cell of the matrix. Nonetheless, the shape serves as a soft limit for reads and writes: you'll get an exception trying to read or write outside of these.

As a general rule you'll see the following:

* An `X` array's shape is nobs x nvar
* An `obsm` array's shape is `nobs` x some number, maybe 20
* An `obsp` array's shape is `nobs` x `nobs`
* A `varm` array's shape is `nvar` x some number, maybe 20
* A `varp` array's shape is `nvar` x `nvar`

## Advanced usage: dataframes with non-standard index columns

In the SOMA data model, the `SOMASparseNDArray` and `SOMADenseNDArray` objects always have int64 dimensions named `soma_dim_0`, `soma_dim_1`, and up, and they have a numeric `soma_data` attribute for the contents of the array. Furthermore, this is always the case.

```{r}
exp$ms$get("RNA")$X$get("data")$schema()
```

For dataframes, though, while there must be a `soma_joinid` column of type int64, you can have one or more other index columns in addtion --- or, `soma_joinid` can be a non-index column.

```{r}
exp$obs$schema()
```

But really, dataframes are capable of more than that, via the index-column names you specify at creation time.

Let's create a couple dataframes, with the same data, but different choices of index-column names.

```{r}
sdfuri1 <- withr::local_tempdir("sdf1")
sdfuri2 <- withr::local_tempdir("sdf2")
```

```{r}
asch <- arrow::schema(
arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
arrow::field("mystring", arrow::large_utf8(), nullable = FALSE),
arrow::field("myint", arrow::int32(), nullable = FALSE),
arrow::field("myfloat", arrow::float32(), nullable = FALSE)
)
soma_joinid = c(0, 1)
mystring = c("hello", "world")
myint = c(33, 44)
myfloat = c(4.5, 5.5)
tbl <- arrow::arrow_table(
soma_joinid = c(soma_joinid),
mystring = c(mystring),
myint = c(myint),
myfloat = c(myfloat)
)
```

```{r}
sdf1 <- SOMADataFrameCreate(
sdfuri1,
asch,
index_column_names = c("soma_joinid", "mystring"),
domain = list(soma_joinid = c(0, 9), mystring = NULL)
)
sdf1$write(tbl)
sdf1$close()
```

Now let's look at the `domain` and `maxdomain` for these dataframes.

```{r}
sdf1 <- SOMADataFrameOpen(sdfuri1)
sdf1$index_column_names()
```

Here we see the `soma_joinid` slot of the dataframe's domain is as requested.

Another point is that domain cannot be specified for string-type index columns. You can set them at create one of two ways:

```{r, eval=FALSE}
domain = list(soma_joinid = (0, 9), mystring = NULL)
```

or

```{r, eval=FALSE}
domain = list(soma_joinid = (0, 9), mystring = c('', ''))
```

and in either case the domain slot for a string-typed index column will read back as `('', '')`.

```{r}
sdf1$domain()
```

```{r}
sdf1$maxdomain()
```

Now let's look at our other dataframe. Here `soma_joinid` is not an index column at all. This is fine, as long as within the data you write to it, the index-column values uniquely identify each row.

```{r}
sdf2 <- SOMADataFrameCreate(
sdfuri2,
asch,
index_column_names = c("myfloat", "myint"),
domain = list(myfloat = c(0, 9999), myint = c(-1000, 1000))
)
sdf2$write(tbl)
sdf2$close()
```

```{r}
sdf2 <- SOMADataFrameOpen(sdfuri1)
sdf2$index_column_names()
```

The domain reads back as written:

```{r}
sdf2$domain()
```

```{r}
sdf2$maxdomain()
```

## Advanced usage: using resize at the dataframe/array level

In the TileDB-SOMA Python API, there is a method for resizing all the dataframes and arrays within an experiment. At present we do not yet offer a corresponding method in the TileDB-SOMA R API, for the simple reason that there is low demand for it. Nonetheless, for completeness, we offer here guidance on how to resizes dataframes and arrays within a TileDB-SOMA experiment.

For N-dimensional arrays that have been upgraded, or that were created using TileDB-SOMA 1.15 or higher, simply do the following:

* If the array's `$tiledbsoma_has_upgraded_shape()` reports `FALSE`, invoke the `$tiledbsoma_upgrade_shape()` method.
* Otherwise invoke the `$.resize()` method.

Let's do a fresh unpack of a pre-1.15 experiment:

```{r}
exp <- load_dataset("soma-exp-pbmc-small-pre-1.15")
exp
```

Here we see that the X array has not been upgraded, and that its shape reports the same as maxshape:

```{r}
X <- exp$ms$get("RNA")$X$get("data")
X$tiledbsoma_has_upgraded_shape()
```

```{r}
X$shape()
```

```{r}
X$maxshape()
```

Given that pre-1.15 TileDB-SOMA-R arrays were created with a maxshape leaving no room for growth, these arrays cannot have their shape resized any further. From 1.15 onward, of course, as we've see above, arrays are created with room for growth and you can resize them upward.

0 comments on commit b4cf341

Please sign in to comment.