Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In memory specification? #4

Open
ivirshup opened this issue Sep 1, 2022 · 9 comments
Open

In memory specification? #4

ivirshup opened this issue Sep 1, 2022 · 9 comments

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Sep 1, 2022

Do we want to consider an in memory specification for sparse matrices?

In theory, arrow has defined these, but I think they are unsupported in practice. It may be useful to keep an eye towards a definition that could be integrated with arrow.

@eriknw
Copy link
Member

eriknw commented Sep 1, 2022

It may be useful to keep an eye towards a definition that could be integrated with arrow.

Yes, let's keep an eye towards arrow. What should we do, and what belongs in arrow? Sparse tensor support in arrow is still experimental. It probably makes sense to reach out at some point.

Arrow currently has COO, CSR, CSC, and CSF. Notably, COO uses a single array instead of one array per dimension, and it has a "is_canonical" flag which indicates whether the COO indices are sorted lexicographically and has no duplicates.

@ivirshup do you have a sense for how much more work it would be to define in-memory specification? Should we limit our scope to in-memory data structures? For example, here is what Arrow has:

For the structure:

SparseIndex:  // Base class
  const SparseTensorFormat::type format_id_;

COO:
  std::shared_ptr<Tensor> coords_;
  bool is_canonical_;

CSR/CSC:
  std::shared_ptr<Tensor> indptr_;
  std::shared_ptr<Tensor> indices_;

CSF:
  std::vector<std::shared_ptr<Tensor>> indptr_;
  std::vector<std::shared_ptr<Tensor>> indices_;
  std::vector<int64_t> axis_order_;

For the tensor (has values):

SparseTensor:
  std::shared_ptr<DataType> type_;
  std::shared_ptr<Buffer> data_;
  std::vector<int64_t> shape_;
  std::shared_ptr<SparseIndex> sparse_index_;
  // These names are optional
  std::vector<std::string> dim_names_;

so it's pretty straightforward. The constructor signatures and other methods are more interesting.

@ivirshup
Copy link
Contributor Author

ivirshup commented Sep 1, 2022

do you have a sense for how much more work it would be to define in-memory specification?

I don't. flatbuffers specifications don't look terribly complicated, but I don't have experience with them. For reference here are the current arrow SparseTensor flat buffer files: https://github.com/apache/arrow/blob/master/format/SparseTensor.fbs

I would note that the sparse tensors definitions here rely on 2d arrays, which also don't have great support. There also isn't a concept of chunked tensors here.

@ivirshup
Copy link
Contributor Author

ivirshup commented Sep 6, 2022

As another point of data here, the tile-db single cell project is looking at arrow sparse tensors for their interchange interface: single-cell-data/SOMA#32 (comment)

@ivirshup
Copy link
Contributor Author

Another, other point of reference the python array-api is using dlpack for in memory interchange, including interchange between devices.

I would like the array-api to also account for sparse arrays, for which densifying is a terrible interchange mechanism. Ideally I'd like the format we're defining here to be used for in-memory array interchange.

@hameerabbasi
Copy link

Continuing the discussion here from data-apis/array-api#840 where I proposed exactly this.

@BenBrock
Copy link
Contributor

It seems to me like not a lot would be required to support in-memory sparse tensors. Binsparse describes how a sparse matrix is split up across one or more component binary arrays, so as long as you have some cross-platform array storage (e.g. Arrow or dlpack), I think it would work.

There might be some related issues that are salient for in-memory formats: e.g., is JSON fast enough for storing metadata? (I really don't know here---it's possible parsing the JSON descriptor is not a bottleneck at all, but for in-memory interchange, which is meant to be zero-copy and thus faster than reading a file, it seems like it might affect performance. Perhaps something like BSON would offer faster parsing of the descriptor?)

There are also potentially a wider variety of in-memory formats you might want to support (CSR4, BSR, etc.), but new formats are easy to add. 😃

@willow-ahrens
Copy link
Collaborator

Ben, could you add @hameerabbasi to the meetings?

@BenBrock
Copy link
Contributor

Sure thing—@hameerabbasi, could you send your email to [email protected]?

@rgommers
Copy link

A few thoughts on the metadata exchange:

  • JSON cannot work, too slow and complex
  • BSON is still overly complex probably (requires a parser, and relying on strings is inherently not the best idea for stability/complexity)
    • It also won't be ideal for performance to have to parse BSON, and low overhead seems to matter enough for arrays that the overhead for in-memory array interchange for some set of users should be O(1 us) like DLPack rather than O(100 us) like __cuda_array_interface__.
  • For a C interface with a stable ABI, assuming that that's desirable, use enums for things like format specifiers instead of strings.
    • Copying what dlpack.h does as much as possible, including versioning mechanics and lifetime management, is probably the safest thing to do.
  • For a Python interface (easier to get going, if high-overhead long term), translating the on-disk JSON directly to a Python dict should do the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants