In memory specification? #4

ivirshup · 2022-09-01T16:36:39Z

Do we want to consider an in memory specification for sparse matrices?

In theory, arrow has defined these, but I think they are unsupported in practice. It may be useful to keep an eye towards a definition that could be integrated with arrow.

eriknw · 2022-09-01T17:19:50Z

It may be useful to keep an eye towards a definition that could be integrated with arrow.

Yes, let's keep an eye towards arrow. What should we do, and what belongs in arrow? Sparse tensor support in arrow is still experimental. It probably makes sense to reach out at some point.

Arrow currently has COO, CSR, CSC, and CSF. Notably, COO uses a single array instead of one array per dimension, and it has a "is_canonical" flag which indicates whether the COO indices are sorted lexicographically and has no duplicates.

@ivirshup do you have a sense for how much more work it would be to define in-memory specification? Should we limit our scope to in-memory data structures? For example, here is what Arrow has:

For the structure:

SparseIndex:  // Base class
  const SparseTensorFormat::type format_id_;

COO:
  std::shared_ptr<Tensor> coords_;
  bool is_canonical_;

CSR/CSC:
  std::shared_ptr<Tensor> indptr_;
  std::shared_ptr<Tensor> indices_;

CSF:
  std::vector<std::shared_ptr<Tensor>> indptr_;
  std::vector<std::shared_ptr<Tensor>> indices_;
  std::vector<int64_t> axis_order_;

For the tensor (has values):

SparseTensor:
  std::shared_ptr<DataType> type_;
  std::shared_ptr<Buffer> data_;
  std::vector<int64_t> shape_;
  std::shared_ptr<SparseIndex> sparse_index_;
  // These names are optional
  std::vector<std::string> dim_names_;

so it's pretty straightforward. The constructor signatures and other methods are more interesting.

ivirshup · 2022-09-01T19:08:18Z

do you have a sense for how much more work it would be to define in-memory specification?

I don't. flatbuffers specifications don't look terribly complicated, but I don't have experience with them. For reference here are the current arrow SparseTensor flat buffer files: https://github.com/apache/arrow/blob/master/format/SparseTensor.fbs

I would note that the sparse tensors definitions here rely on 2d arrays, which also don't have great support. There also isn't a concept of chunked tensors here.

ivirshup · 2022-09-06T18:24:34Z

As another point of data here, the tile-db single cell project is looking at arrow sparse tensors for their interchange interface: single-cell-data/SOMA#32 (comment)

ivirshup · 2023-08-21T15:38:13Z

Another, other point of reference the python array-api is using dlpack for in memory interchange, including interchange between devices.

I would like the array-api to also account for sparse arrays, for which densifying is a terrible interchange mechanism. Ideally I'd like the format we're defining here to be used for in-memory array interchange.

hameerabbasi · 2024-09-19T09:07:09Z

Continuing the discussion here from data-apis/array-api#840 where I proposed exactly this.

BenBrock · 2024-09-19T16:14:49Z

It seems to me like not a lot would be required to support in-memory sparse tensors. Binsparse describes how a sparse matrix is split up across one or more component binary arrays, so as long as you have some cross-platform array storage (e.g. Arrow or dlpack), I think it would work.

There might be some related issues that are salient for in-memory formats: e.g., is JSON fast enough for storing metadata? (I really don't know here---it's possible parsing the JSON descriptor is not a bottleneck at all, but for in-memory interchange, which is meant to be zero-copy and thus faster than reading a file, it seems like it might affect performance. Perhaps something like BSON would offer faster parsing of the descriptor?)

There are also potentially a wider variety of in-memory formats you might want to support (CSR4, BSR, etc.), but new formats are easy to add. 😃

willow-ahrens · 2024-09-19T16:16:23Z

Ben, could you add @hameerabbasi to the meetings?

BenBrock · 2024-09-19T16:36:44Z

Sure thing—@hameerabbasi, could you send your email to [email protected]?

rgommers · 2024-09-20T09:47:21Z

A few thoughts on the metadata exchange:

JSON cannot work, too slow and complex
BSON is still overly complex probably (requires a parser, and relying on strings is inherently not the best idea for stability/complexity)
- It also won't be ideal for performance to have to parse BSON, and low overhead seems to matter enough for arrays that the overhead for in-memory array interchange for some set of users should be O(1 us) like DLPack rather than O(100 us) like __cuda_array_interface__.
For a C interface with a stable ABI, assuming that that's desirable, use enums for things like format specifiers instead of strings.
- Copying what dlpack.h does as much as possible, including versioning mechanics and lifetime management, is probably the safest thing to do.
For a Python interface (easier to get going, if high-overhead long term), translating the on-disk JSON directly to a Python dict should do the job.

rgommers mentioned this issue Feb 14, 2024

[DISCUSS] Are Sparse NdArrays out of scope? dmlc/dlpack#127

Open

rgommers mentioned this issue Sep 19, 2024

RFC: In-memory sparse array interchange data-apis/array-api#840

Open

leofang mentioned this issue Sep 26, 2024

Add __binsparse__ protocol. cupy/cupy#8622

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In memory specification? #4

In memory specification? #4

ivirshup commented Sep 1, 2022

eriknw commented Sep 1, 2022 •

edited

Loading

ivirshup commented Sep 1, 2022

ivirshup commented Sep 6, 2022

ivirshup commented Aug 21, 2023

hameerabbasi commented Sep 19, 2024

BenBrock commented Sep 19, 2024

willow-ahrens commented Sep 19, 2024

BenBrock commented Sep 19, 2024

rgommers commented Sep 20, 2024

In memory specification? #4

In memory specification? #4

Comments

ivirshup commented Sep 1, 2022

eriknw commented Sep 1, 2022 • edited Loading

ivirshup commented Sep 1, 2022

ivirshup commented Sep 6, 2022

ivirshup commented Aug 21, 2023

hameerabbasi commented Sep 19, 2024

BenBrock commented Sep 19, 2024

willow-ahrens commented Sep 19, 2024

BenBrock commented Sep 19, 2024

rgommers commented Sep 20, 2024

eriknw commented Sep 1, 2022 •

edited

Loading