Skip to content

Commit

Permalink
**Issue and/or context:**
Browse files Browse the repository at this point in the history
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.
  • Loading branch information
eddelbuettel authored and ihnorton committed Sep 15, 2023
1 parent 943f48d commit 4a1cebe
Show file tree
Hide file tree
Showing 24 changed files with 317 additions and 61 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/r-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ on:
branches:
- main
- 'release-*'

workflow_dispatch:

env:
COVERAGE_FLAGS: "r"
COVERAGE_TOKEN: ${{ secrets.CODECOV_TOKEN }}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/r-python-interop-testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ on:
branches:
- main
- 'release-*'
workflow_dispatch:

jobs:
ci:
Expand Down
1 change: 1 addition & 0 deletions apis/r/.Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ tiledbsoma.tar.gz
# subdirectories of soft-linked libtiledbsoma
src/libtiledbsoma/build
src/libtiledbsoma/test
src/libtiledbsoma/docs

# vscode
^\.vscode$
Expand Down
24 changes: 16 additions & 8 deletions apis/r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,25 @@ Description: Interface for working with 'TileDB'-based Stack of Matrices,
<https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md>.
Version: 1.4.3.1
Authors@R: c(
person(given = "Aaron", family = "Wolen",
role = c("cre", "aut"), email = "[email protected]",
person(given = "Aaron",
family = "Wolen",
role = c("cre", "aut"),
email = "[email protected]",
comment = c(ORCID = "0000-0003-2542-2202")),
person(given = "Dirk", family = "Eddelbuettel",
role = "aut", email = "[email protected]",
person(given = "Dirk",
family = "Eddelbuettel",
email = "[email protected]",
role = "aut",
comment = c(ORCID = "0000-0001-6419-907X")),
person(given = "Paul", family = "Hoffman",
role = "aut", email = "[email protected]",
person(given = "Paul",
family = "Hoffman",
email = "[email protected]",
role = "aut",
comment = c(ORCID = "0000-0002-7693-8957")),
person(given = "John", family = "Kerl",
role = "aut", email = "[email protected]"),
person(given = "John",
family = "Kerl",
email = "[email protected]",
role = "aut"),
person(given = "TileDB, Inc.",
role = c("cph", "fnd"))
)
Expand Down
1 change: 1 addition & 0 deletions apis/r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ export(extract_dataset)
export(list_datasets)
export(load_dataset)
export(matrixZeroBasedView)
export(set_log_level)
export(show_package_versions)
export(tiledbsoma_stats_disable)
export(tiledbsoma_stats_dump)
Expand Down
4 changes: 3 additions & 1 deletion apis/r/R/Factory.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,16 @@
#' @param uri URI for the TileDB object
#' @param schema schema Arrow schema argument passed on to DataFrame$create()
#' @param index_column_names Index column names passed on to DataFrame$create()
#' @param levels Optional list of enumeration (aka factor) levels
#' @param platform_config Optional platform configuration
#' @param tiledbsoma_ctx Optional SOMATileDBContext
#' @param tiledb_timestamp Optional Datetime (POSIXct) for TileDB timestamp
#' @export
SOMADataFrameCreate <- function(uri, schema, index_column_names = c("soma_joinid"),
levels = NULL,
platform_config = NULL, tiledbsoma_ctx = NULL, tiledb_timestamp = NULL) {
sdf <- SOMADataFrame$new(uri, platform_config, tiledbsoma_ctx, tiledb_timestamp, internal_use_only = "allowed_use")
sdf$create(schema, index_column_names=index_column_names, platform_config=platform_config, internal_use_only = "allowed_use")
sdf$create(schema, index_column_names=index_column_names, levels=levels, platform_config=platform_config, internal_use_only = "allowed_use")

sdf
}
Expand Down
7 changes: 6 additions & 1 deletion apis/r/R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@ soma_array_reader_impl <- function(uri, colnames = NULL, qc = NULL, dim_points =
.Call(`_tiledbsoma_soma_array_reader`, uri, colnames, qc, dim_points, dim_ranges, batch_size, result_order, loglevel, config)
}

#' @noRd
#' Set the logging level for the R package and underlying C++ library
#'
#' @param level A character value with logging level understood by \sQuote{spdlog}
#' such as \dQuote{trace}, \dQuote{debug}, \dQuote{info}, or \dQuote{warn}.
#' @return Nothing is returned as the function is invoked for the side-effect.
#' @export
set_log_level <- function(level) {
invisible(.Call(`_tiledbsoma_set_log_level`, level))
}
Expand Down
1 change: 0 additions & 1 deletion apis/r/R/ReadIter.R
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ ReadIter <- R6::R6Class(

# Internal 'external pointer' object used for iterated reads
soma_reader_pointer = NULL,
#ctx_pointer = NULL,

# to be refined in derived classes
soma_reader_transform = function(x) {
Expand Down
25 changes: 14 additions & 11 deletions apis/r/R/SOMADataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,11 @@ SOMADataFrame <- R6::R6Class(
#' @param index_column_names A vector of column names to use as user-defined
#' index columns. All named columns must exist in the schema, and at least
#' one index column name is required.
#' @param levels Optional list of enumeration (aka factor) levels
#' @template param-platform-config
#' @param internal_use_only Character value to signal this is a 'permitted' call,
#' as `create()` is considered internal and should not be called directly.
create = function(schema, index_column_names = c("soma_joinid"), platform_config = NULL, internal_use_only = NULL) {
create = function(schema, index_column_names = c("soma_joinid"), levels = NULL, platform_config = NULL, internal_use_only = NULL) {
if (is.null(internal_use_only) || internal_use_only != "allowed_use") {
stop(paste("Use of the create() method is for internal use only. Consider using a",
"factory method as e.g. 'SOMADataFrameCreate()'."), call. = FALSE)
Expand Down Expand Up @@ -94,9 +95,15 @@ SOMADataFrame <- R6::R6Class(

for (field_name in attr_column_names) {
field <- schema$GetFieldByName(field_name)
tdb_attrs[[field_name]] <- tiledb_attr_from_arrow_field(
schema$GetFieldByName(field_name),
tiledb_create_options = tiledb_create_options
field_type <- tiledb_type_from_arrow_type(field$type)

tdb_attrs[[field_name]] <- tiledb::tiledb_attr(
name = field_name,
type = field_type,
nullable = field$nullable,
ncells = if (field_type == "ASCII") NA_integer_ else 1L,
filter_list = tiledb::tiledb_filter_list(tiledb_create_options$attr_filters(field_name)),
enumeration = levels[[field_name]]
)
}

Expand All @@ -110,13 +117,10 @@ SOMADataFrame <- R6::R6Class(
tile_order = cell_tile_orders["tile_order"],
capacity = tiledb_create_options$capacity(),
allows_dups = tiledb_create_options$allows_duplicates(),
offsets_filter_list = tiledb::tiledb_filter_list(
tiledb_create_options$offsets_filters()
),
validity_filter_list = tiledb::tiledb_filter_list(
tiledb_create_options$validity_filters()
offsets_filter_list = tiledb::tiledb_filter_list(tiledb_create_options$offsets_filters()),
validity_filter_list = tiledb::tiledb_filter_list(tiledb_create_options$validity_filters()),
enumerations = if (any(!sapply(levels, is.null))) levels else NULL
)
)

# create array
tiledb::tiledb_array_create(uri = self$uri, schema = tdb_schema)
Expand Down Expand Up @@ -242,7 +246,6 @@ SOMADataFrame <- R6::R6Class(
#' names will be extracted and added as a new column to the `data.frame`
#' prior to performing the update. The name of this new column will be set
#' to the value specified by `row_index_name`.

update = function(values, row_index_name = NULL) {
private$check_open_for_write()
stopifnot(
Expand Down
25 changes: 13 additions & 12 deletions apis/r/R/TableReadIter.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#' SOMA Read Iterator over Arrow Table
#'
#' @description
#' `TableReadIter` is a class that allows for iteration over
#' a reads on \link{SOMASparseNDArray} and \link{SOMADataFrame}.
#' `TableReadIter` is a class that allows for iteration over
#' a reads on \link{SOMASparseNDArray} and \link{SOMADataFrame}.
#' Iteration chunks are retrieved as arrow::\link[arrow]{Table}
#' @export

Expand All @@ -11,32 +11,33 @@ TableReadIter <- R6::R6Class(
inherit = ReadIter,

public = list(

#' @description Concatenate remainder of iterator.
#' @return arrow::\link[arrow]{Table}
concat = function(){

if(self$read_complete()) {
warning("Iteration complete, returning NULL")
return(NULL)
}

tbl <- self$read_next()

while (!self$read_complete()) {
tbl <- arrow::concat_tables(tbl, self$read_next())
}

tbl

}),

private = list(

## refined from base class
soma_reader_transform = function(x) {
soma_array_to_arrow_table(x)
at <- soma_array_to_arrow_table(x)
at
}

)
)
19 changes: 18 additions & 1 deletion apis/r/R/utils-arrow.R
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ tiledb_type_from_arrow_type <- function(x) {
# fixed_size_list = "fixed_size_list",
# map_of = "map",
# duration = "duration",
stop("Unsupported data type: ", x$name, call. = FALSE)
dictionary = "INT32", # for a dictionary the 'values' are ints, levels are character
stop("Unsupported Arrow data type: ", x$name, call. = FALSE)
)
}

Expand Down Expand Up @@ -293,3 +294,19 @@ check_arrow_schema_data_types <- function(from, to) {
}
return(TRUE)
}

#' Extract levels from dictionaries
#' @noRd
extract_levels <- function(arrtbl) {
stopifnot("Argument must be an Arrow Table object" = is_arrow_table(arrtbl))
nm <- names(arrtbl) # we go over the table column by column
reslst <- vector(mode = "list", length = length(nm))
names(reslst) <- nm # and fill a named list, entries default to NULL
for (n in nm) {
if (inherits(arrow::infer_type(arrtbl[[n]]), "DictionaryType")) {
# levels() extracts the enumeration levels from the factor vector we have
reslst[[n]] <- levels(arrtbl[[n]]$as_vector())
}
}
reslst
}
3 changes: 3 additions & 0 deletions apis/r/man/SOMADataFrame.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions apis/r/man/SOMADataFrameCreate.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions apis/r/man/set_log_level.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 20 additions & 3 deletions apis/r/src/rinterface.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ Rcpp::List soma_array_reader(const std::string& uri,
spdl::info("[soma_array_reader] Reading from {}", uri);

std::map<std::string, std::string> platform_config = config_vector_to_map(config);
// to create a Context object:
// std::make_shared<Context>(Config(platform_config)),

std::vector<std::string> column_names = {};
if (!colnames.isNull()) { // If we have column names, select them
Expand All @@ -78,7 +80,7 @@ Rcpp::List soma_array_reader(const std::string& uri,
auto sr = tdbs::SOMAArray::open(OpenMode::read,
uri,
"unnamed", // name parameter could be added
platform_config, // to add, done in iterated reader
platform_config,
column_names,
batch_size,
tdb_result_order);
Expand Down Expand Up @@ -151,11 +153,21 @@ Rcpp::List soma_array_reader(const std::string& uri,
memcpy((void*) chldschemaxp, pp.second.get(), sizeof(ArrowSchema));
memcpy((void*) chldarrayxp, pp.first.get(), sizeof(ArrowArray));

spdl::info("[soma_array_reader] Incoming name {} length {}", std::string(pp.second->name), pp.first->length);
spdl::info("[soma_array_reader] Incoming name {} length {}",
std::string(pp.second->name), pp.first->length);

schemaxp->children[i] = chldschemaxp;
arrayxp->children[i] = chldarrayxp;

// if (buf->has_enumeration()) {
// auto vec = buf->get_enumeration();
// Rcpp::Rcout << names[i] << ": ";
// for (auto& s: vec) {
// Rcpp::Rcout << s << " ";
// }
// Rcpp::Rcout << std::endl;
// }

if (pp.first->length > arrayxp->length) {
spdl::debug("[soma_array_reader] Setting array length to {}", pp.first->length);
arrayxp->length = pp.first->length;
Expand All @@ -167,7 +179,12 @@ Rcpp::List soma_array_reader(const std::string& uri,
return as;
}

//' @noRd
//' Set the logging level for the R package and underlying C++ library
//'
//' @param level A character value with logging level understood by \sQuote{spdlog}
//' such as \dQuote{trace}, \dQuote{debug}, \dQuote{info}, or \dQuote{warn}.
//' @return Nothing is returned as the function is invoked for the side-effect.
//' @export
// [[Rcpp::export]]
void set_log_level(const std::string& level) {
spdl::set_level(level);
Expand Down
5 changes: 3 additions & 2 deletions apis/r/src/riterator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ Rcpp::List sr_setup(const std::string& uri,
std::string_view name = "unnamed";
std::vector<std::string> column_names = {};


std::map<std::string, std::string> platform_config = config_vector_to_map(Rcpp::wrap(config));
tiledb::Config cfg(platform_config);
spdl::debug("[sr_setup] creating ctx object with supplied config");
Expand Down Expand Up @@ -121,15 +122,15 @@ Rcpp::List sr_setup(const std::string& uri,
tiledb::Domain domain = schema->domain();
std::vector<tiledb::Dimension> dims = domain.dimensions();
for (auto& dim: dims) {
spdl::debug("[soma_array_reader] Dimension {} type {} domain {} extent {}",
spdl::debug("[sr_setup] Dimension {} type {} domain {} extent {}",
dim.name(), tiledb::impl::to_str(dim.type()),
dim.domain_to_str(), dim.tile_extent_to_str());
name2dim.emplace(std::make_pair(dim.name(), std::make_shared<tiledb::Dimension>(dim)));
}

// If we have a query condition, apply it
if (!qc.isNull()) {
spdl::debug("[soma_array_reader] Applying query condition") ;
spdl::debug("[sr_setup] Applying query condition") ;
Rcpp::XPtr<tiledb::QueryCondition> qcxp(qc);
ptr->set_condition(*qcxp);
}
Expand Down
Loading

0 comments on commit 4a1cebe

Please sign in to comment.