-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In memory specification? #4
Comments
Yes, let's keep an eye towards arrow. What should we do, and what belongs in arrow? Sparse tensor support in arrow is still experimental. It probably makes sense to reach out at some point. Arrow currently has COO, CSR, CSC, and CSF. Notably, COO uses a single array instead of one array per dimension, and it has a "is_canonical" flag which indicates whether the COO indices are sorted lexicographically and has no duplicates. @ivirshup do you have a sense for how much more work it would be to define in-memory specification? Should we limit our scope to in-memory data structures? For example, here is what Arrow has: For the structure: SparseIndex: // Base class
const SparseTensorFormat::type format_id_;
COO:
std::shared_ptr<Tensor> coords_;
bool is_canonical_;
CSR/CSC:
std::shared_ptr<Tensor> indptr_;
std::shared_ptr<Tensor> indices_;
CSF:
std::vector<std::shared_ptr<Tensor>> indptr_;
std::vector<std::shared_ptr<Tensor>> indices_;
std::vector<int64_t> axis_order_; For the tensor (has values): SparseTensor:
std::shared_ptr<DataType> type_;
std::shared_ptr<Buffer> data_;
std::vector<int64_t> shape_;
std::shared_ptr<SparseIndex> sparse_index_;
// These names are optional
std::vector<std::string> dim_names_; so it's pretty straightforward. The constructor signatures and other methods are more interesting. |
I don't. I would note that the sparse tensors definitions here rely on 2d arrays, which also don't have great support. There also isn't a concept of chunked tensors here. |
As another point of data here, the tile-db single cell project is looking at arrow sparse tensors for their interchange interface: single-cell-data/SOMA#32 (comment) |
Another, other point of reference the python array-api is using dlpack for in memory interchange, including interchange between devices. I would like the array-api to also account for sparse arrays, for which densifying is a terrible interchange mechanism. Ideally I'd like the format we're defining here to be used for in-memory array interchange. |
Continuing the discussion here from data-apis/array-api#840 where I proposed exactly this. |
It seems to me like not a lot would be required to support in-memory sparse tensors. Binsparse describes how a sparse matrix is split up across one or more component binary arrays, so as long as you have some cross-platform array storage (e.g. Arrow or dlpack), I think it would work. There might be some related issues that are salient for in-memory formats: e.g., is JSON fast enough for storing metadata? (I really don't know here---it's possible parsing the JSON descriptor is not a bottleneck at all, but for in-memory interchange, which is meant to be zero-copy and thus faster than reading a file, it seems like it might affect performance. Perhaps something like BSON would offer faster parsing of the descriptor?) There are also potentially a wider variety of in-memory formats you might want to support (CSR4, BSR, etc.), but new formats are easy to add. 😃 |
Ben, could you add @hameerabbasi to the meetings? |
Sure thing—@hameerabbasi, could you send your email to [email protected]? |
A few thoughts on the metadata exchange:
|
Do we want to consider an in memory specification for sparse matrices?
In theory, arrow has defined these, but I think they are unsupported in practice. It may be useful to keep an eye towards a definition that could be integrated with arrow.
The text was updated successfully, but these errors were encountered: