type and storage of `instance_key` issues #158

giovp · 2023-02-27T20:28:05Z

This is to keep track of issues that might arise on the relationship between instance_key and the indexes of regions.

Motivation

The link between a table and it's region object is specified by the tuple ("region", "region_key") saved in table.uns["spatialdata_attrs"]. The types are the following

region: Union[str, Sequence[str]]. It can contain a single region id or a sequence of region ids
region_key: Optional[str]. In case of a single region_id, then it can be None

To link the single region instance, e.g. the cell, an additional key is passed:

instance_key: Sequence[str, float, int]

Problems

We don't check for the consistency of this mapping in spatial data, e.g.:
- are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this
- is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.
we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

another thing to consider is that the instance_key can be just he index in a Shapes element, but it can never be an index in table.obs, because we can only have str and numeric indexes in anndata. This is also something we should be clearer about.

The text was updated successfully, but these errors were encountered:

LucaMarconato · 2023-02-27T22:40:23Z

I wrongly thought that region_key had to be None when region is a string. I will update _concatenate_tables() accordingly.

TODO:

fix _concatenate_table(). Done here: fixed schema for table; fixed _concatenate_tables() #159

LucaMarconato · 2023-02-27T22:54:52Z

Yeah there was this code in Table.validate() (git blame on me 😅). I'll remove it.

elif isinstance(attr["region"], str):
    assert attr["region_key"] is None

EDIT:
removing it also from the parser (git blame on both 🙈)

giovp · 2023-02-28T08:28:49Z

I wrongly thought that region_key had to be None when region is a string. I will update _concatenate_tables() accordingly.

TODO:

fix _concatenate_table(). Done here: fixed schema for table; fixed _concatenate_tables() #159

but this is in fact true! see what I wrote above

region_key: Optional[str]. In case of a single region_id, then it can be None

so no need to fix

LucaMarconato · 2023-02-28T09:08:36Z

region_key: Optional[str]. In case of a single region_id, then it can be None

From this formulation I understand that if region is a string, then region_key is allowed to be either a string either None. I checked this from the table specs and no restriction is made on region_key. I think we should decide on one of the two behaviors. If we go for region == str implies region_key == None maybe we should update the table specs.

giovp · 2023-02-28T09:12:46Z

region_key: Optional[str]. In case of a single region_id, then it can be None

From this formulation I understand that if region is a string, then region_key is allowed to be either a string either None. I checked the table specs and no restriction is made on region_key. I think we should decide on one of the two behaviors. If we go for region == str implies region_key == None maybe we should update the table specs.

got it, you mean that can imply that it could be but doesn't have to be None? I understand but then I would revert back to the must probably easier to handle in genera, although concatenation more tricky? I guess the easiest solution for concatenation would be that even if region is str, region_key still existed right?

LucaMarconato · 2023-02-28T09:21:04Z

got it, you mean that can imply that it could be but doesn't have to be None?

yes, we can replace it with must (well, MUST).

Yes for concatenation the problem is when the region_key can be None. But it's fine, the code is already written and it's just a matter of adjusting it to the precise specs. It would be cool if ad.concat was enough for concatenating, but this is not possible in any case (if two different elements have different region_key, when we can't just merge the region_key).

LucaMarconato · 2023-02-28T09:24:13Z

we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

true, especially in spatialdata-io, we should find a nice way of calling them.

are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this

I prefer allowing regions that are not presents because, but no strong opinions.

is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.

We got some bugs with this (the type was string in one object and int in the other), I think we should be clear that the type should match.

kevinyamauchi · 2023-02-28T11:12:28Z

got it, you mean that can imply that it could be but doesn't have to be None? I understand but then I would revert back to the must probably easier to handle in genera, although concatenation more tricky? I guess the easiest solution for concatenation would be that even if region is str, region_key still existed right?

I’m not sure I’m following here. @giovp, do you mean that for SpatialData we require that even if a table annotates a single region (I.e., ‘region’ is a string), we need a valid ‘region_key’ column? I’m guessing the rationale is that it makes the concatenation at easier in some cases since the resulting table would need a ‘region_key’ column.

I think we will still have to do some parsing of the ‘region_key’ column during concatenation, even if we require a region_key column for all tables. This is because the region_key value might not be the same for all columns being annotated.

Or did you mean that when ‘region’ is a string, ‘region_key’ MUST be None? If that’s the case, then I agree we should update the tables spec to match that. We can still proceed with that here though without waiting for the dust to settle on the tables spec.

kevinyamauchi · 2023-02-28T11:17:26Z

we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

true, especially in spatialdata-io, we should find a nice way of calling them.

+1

are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this

I prefer allowing regions that are not presents because, but no strong opinions.

I am open to allowing references to regions that are not present. However, if that’s the case, we should have a simple method to filter a SpatialData.table to only include rows that annotate regions in the object.

is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.

We got some bugs with this (the type was string in one object and int in the other), I think we should be clear that the type should match.

I agree that we should validate the region/instance_key values when parsing the models.

giovp · 2023-02-28T13:27:33Z

ok, I'll try to summarise the solutions below:

region_key should always be present, even if region is str
instance_key type should be consistent across regions. I think the easiest solution is to enforce it to be int32/int64 but I understand people would want to have pandas series of str type. Had issues serializing it with zarr so need to take another look
in spatialdata have a method that validates that at least all regions in spatialdata are also present in the table. I understand it's fine to have regions in table that are not in spatialdata, but not the other way round, wdyt?

LucaMarconato · 2023-02-28T13:44:42Z

in spatialdata have a method that validates that at least all regions in spatialdata are also present in the table. I understand it's fine to have regions in table that are not in spatialdata, but not the other way round, wdyt?

I would also have a method to actually filter the data to make them match (we need two parameters, like fiter_table: bool and filter_elements). This could also remove images that are not in a coordinate system together with the regions.

Regarding regions that are not annotated by the table, this case is super common. Especially since the table can only annotated limited elements, so most of the elements will not be annotated. Examples: the "anatomical" (polygon/shapes) element in merfish, or the nuclei regions in xenium (the table annotates membrane, not nuclei is annotated by the table)

LucaMarconato · 2023-02-28T13:55:57Z

Unrelated to the above but related to instance_key. I think that the column specified by instance_key should never be categorical, so we should remove this line:

spatialdata/spatialdata/_core/models.py

Line 464 in 4fb3433

if not is_categorical_dtype(data[instance_key]):

giovp · 2023-02-28T14:44:20Z

Regarding regions that are not annotated by the table, this case is super common. Especially since the table can only annotated limited elements, so most of the elements will not be annotated. Examples: the "anatomical" (polygon/shapes) element in merfish, or the nuclei regions in xenium (the table annotates membrane, not nuclei is annotated by the table)

that's a very good point, completely missed that, should be very flexible

LucaMarconato · 2023-02-28T16:07:43Z

@giovp

region_key should always be present, even if region is str

I'll update this PR to reflect this. #159

All the io code and the to_zarr.py in spatialdata sandbox need also to be updated. I can do that in spatialdata-sandbox in this PR giovp/spatialdata-sandbox#17 (or also maybe just do it during the review and then merge).

kevinyamauchi · 2023-02-28T16:39:15Z

@giovp and @LucaMarconato , thanks for looking into this. I like/agree with the following from above:

region_key should always be present, even if region is str
instance_key type should be consistent across regions. I think the easiest solution is to enforce it to be int32/int64 but I understand people would want to have pandas series of str type. Had issues serializing it with zarr so need to take another look

I agree with @LucaMarconato about the methods to filter tables for elements in the containing SpatialData object and SpatialData objects for elements that are in the table.

I also agree that we need to allow elements that are not in the table (but I think @giovp you are already on board with that)

LucaMarconato · 2024-03-13T13:02:11Z

So even if the consensus was to use int for instance_key, somehow we did not changed it and the table model currently allows strings. Changing this now would require a file format change since the new validator would not pass on old datasets.

I propose to change the readers in spatialdata-io (in particular xenium()) to start using int for instance_key and later on deprecate the use of str, change the file format versions and provide a tested migration tool.

CC @melonora

LucaMarconato mentioned this issue Feb 28, 2023

fixed schema for table; fixed _concatenate_tables() #159

Closed

giovp mentioned this issue Feb 28, 2023

instance_key as categorical in PointsModel #160

Closed

giovp mentioned this issue Mar 1, 2023

update table model #164

Merged

5 tasks

LucaMarconato added I/O 💿 element: table 📑 labels Mar 13, 2024

LucaMarconato mentioned this issue Mar 13, 2024

Test spatialelement table join scverse/napari-spatialdata#208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

type and storage of `instance_key` issues #158

type and storage of `instance_key` issues #158

giovp commented Feb 27, 2023 •

edited

Loading

LucaMarconato commented Feb 27, 2023 •

edited

Loading

LucaMarconato commented Feb 27, 2023 •

edited

Loading

giovp commented Feb 28, 2023 •

edited

Loading

LucaMarconato commented Feb 28, 2023 •

edited

Loading

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

kevinyamauchi commented Feb 28, 2023 •

edited

Loading

kevinyamauchi commented Feb 28, 2023

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023 •

edited

Loading

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

kevinyamauchi commented Feb 28, 2023 •

edited

Loading

LucaMarconato commented Mar 13, 2024

type and storage of instance_key issues #158

type and storage of instance_key issues #158

Comments

giovp commented Feb 27, 2023 • edited Loading

Motivation

Problems

LucaMarconato commented Feb 27, 2023 • edited Loading

LucaMarconato commented Feb 27, 2023 • edited Loading

giovp commented Feb 28, 2023 • edited Loading

LucaMarconato commented Feb 28, 2023 • edited Loading

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

kevinyamauchi commented Feb 28, 2023 • edited Loading

kevinyamauchi commented Feb 28, 2023

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023 • edited Loading

giovp commented Feb 28, 2023

LucaMarconato commented Feb 28, 2023

kevinyamauchi commented Feb 28, 2023 • edited Loading

LucaMarconato commented Mar 13, 2024

type and storage of `instance_key` issues #158

type and storage of `instance_key` issues #158

giovp commented Feb 27, 2023 •

edited

Loading

LucaMarconato commented Feb 27, 2023 •

edited

Loading

LucaMarconato commented Feb 27, 2023 •

edited

Loading

giovp commented Feb 28, 2023 •

edited

Loading

LucaMarconato commented Feb 28, 2023 •

edited

Loading

kevinyamauchi commented Feb 28, 2023 •

edited

Loading

LucaMarconato commented Feb 28, 2023 •

edited

Loading

kevinyamauchi commented Feb 28, 2023 •

edited

Loading