Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type and storage of instance_key issues #158

Open
giovp opened this issue Feb 27, 2023 · 16 comments
Open

type and storage of instance_key issues #158

giovp opened this issue Feb 27, 2023 · 16 comments

Comments

@giovp
Copy link
Member

giovp commented Feb 27, 2023

This is to keep track of issues that might arise on the relationship between instance_key and the indexes of regions.

Motivation

The link between a table and it's region object is specified by the tuple ("region", "region_key") saved in table.uns["spatialdata_attrs"]. The types are the following

  • region: Union[str, Sequence[str]]. It can contain a single region id or a sequence of region ids
  • region_key: Optional[str]. In case of a single region_id, then it can be None

To link the single region instance, e.g. the cell, an additional key is passed:

  • instance_key: Sequence[str, float, int]

Problems

  • We don't check for the consistency of this mapping in spatial data, e.g.:
    • are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this
    • is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.
  • we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

another thing to consider is that the instance_key can be just he index in a Shapes element, but it can never be an index in table.obs, because we can only have str and numeric indexes in anndata. This is also something we should be clearer about.

@LucaMarconato
Copy link
Member

LucaMarconato commented Feb 27, 2023

I wrongly thought that region_key had to be None when region is a string. I will update _concatenate_tables() accordingly.

TODO:

@LucaMarconato
Copy link
Member

LucaMarconato commented Feb 27, 2023

Yeah there was this code in Table.validate() (git blame on me 😅). I'll remove it.

elif isinstance(attr["region"], str):
    assert attr["region_key"] is None

EDIT:
removing it also from the parser (git blame on both 🙈)

@giovp
Copy link
Member Author

giovp commented Feb 28, 2023

I wrongly thought that region_key had to be None when region is a string. I will update _concatenate_tables() accordingly.

TODO:

but this is in fact true! see what I wrote above

region_key: Optional[str]. In case of a single region_id, then it can be None

so no need to fix

@LucaMarconato
Copy link
Member

LucaMarconato commented Feb 28, 2023

region_key: Optional[str]. In case of a single region_id, then it can be None

From this formulation I understand that if region is a string, then region_key is allowed to be either a string either None. I checked this from the table specs and no restriction is made on region_key. I think we should decide on one of the two behaviors. If we go for region == str implies region_key == None maybe we should update the table specs.

@giovp
Copy link
Member Author

giovp commented Feb 28, 2023

region_key: Optional[str]. In case of a single region_id, then it can be None

From this formulation I understand that if region is a string, then region_key is allowed to be either a string either None. I checked the table specs and no restriction is made on region_key. I think we should decide on one of the two behaviors. If we go for region == str implies region_key == None maybe we should update the table specs.

got it, you mean that can imply that it could be but doesn't have to be None? I understand but then I would revert back to the must probably easier to handle in genera, although concatenation more tricky? I guess the easiest solution for concatenation would be that even if region is str, region_key still existed right?

@LucaMarconato
Copy link
Member

got it, you mean that can imply that it could be but doesn't have to be None?

yes, we can replace it with must (well, MUST).

Yes for concatenation the problem is when the region_key can be None. But it's fine, the code is already written and it's just a matter of adjusting it to the precise specs. It would be cool if ad.concat was enough for concatenating, but this is not possible in any case (if two different elements have different region_key, when we can't just merge the region_key).

@LucaMarconato
Copy link
Member

we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

true, especially in spatialdata-io, we should find a nice way of calling them.

are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this

I prefer allowing regions that are not presents because, but no strong opinions.

is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.

We got some bugs with this (the type was string in one object and int in the other), I think we should be clear that the type should match.

@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Feb 28, 2023

got it, you mean that can imply that it could be but doesn't have to be None? I understand but then I would revert back to the must probably easier to handle in genera, although concatenation more tricky? I guess the easiest solution for concatenation would be that even if region is str, region_key still existed right?

I’m not sure I’m following here. @giovp, do you mean that for SpatialData we require that even if a table annotates a single region (I.e., ‘region’ is a string), we need a valid ‘region_key’ column? I’m guessing the rationale is that it makes the concatenation at easier in some cases since the resulting table would need a ‘region_key’ column.

I think we will still have to do some parsing of the ‘region_key’ column during concatenation, even if we require a region_key column for all tables. This is because the region_key value might not be the same for all columns being annotated.

Or did you mean that when ‘region’ is a string, ‘region_key’ MUST be None? If that’s the case, then I agree we should update the tables spec to match that. We can still proceed with that here though without waiting for the dust to settle on the tables spec.

@kevinyamauchi
Copy link
Collaborator

we are not clear nor consistent on the way we name the regions and elements in the spatialdata_attrs

true, especially in spatialdata-io, we should find a nice way of calling them.

+1

are all regions present in the table, may/should/must be in the Spatialdata? we are not clear about this

I prefer allowing regions that are not presents because, but no strong opinions.

I am open to allowing references to regions that are not present. However, if that’s the case, we should have a simple method to filter a SpatialData.table to only include rows that annotate regions in the object.

is the type of instance_key matching the type of the index/region in the respective element? e.g. for labels,, it should just be int32 or int64, whereas for shapes, it could be str or numeric.

We got some bugs with this (the type was string in one object and int in the other), I think we should be clear that the type should match.

I agree that we should validate the region/instance_key values when parsing the models.

@giovp
Copy link
Member Author

giovp commented Feb 28, 2023

ok, I'll try to summarise the solutions below:

  • region_key should always be present, even if region is str
  • instance_key type should be consistent across regions. I think the easiest solution is to enforce it to be int32/int64 but I understand people would want to have pandas series of str type. Had issues serializing it with zarr so need to take another look
  • in spatialdata have a method that validates that at least all regions in spatialdata are also present in the table. I understand it's fine to have regions in table that are not in spatialdata, but not the other way round, wdyt?

@LucaMarconato
Copy link
Member

in spatialdata have a method that validates that at least all regions in spatialdata are also present in the table. I understand it's fine to have regions in table that are not in spatialdata, but not the other way round, wdyt?

I would also have a method to actually filter the data to make them match (we need two parameters, like fiter_table: bool and filter_elements). This could also remove images that are not in a coordinate system together with the regions.

Regarding regions that are not annotated by the table, this case is super common. Especially since the table can only annotated limited elements, so most of the elements will not be annotated. Examples: the "anatomical" (polygon/shapes) element in merfish, or the nuclei regions in xenium (the table annotates membrane, not nuclei is annotated by the table)

@LucaMarconato
Copy link
Member

LucaMarconato commented Feb 28, 2023

Unrelated to the above but related to instance_key. I think that the column specified by instance_key should never be categorical, so we should remove this line:

if not is_categorical_dtype(data[instance_key]):

@giovp
Copy link
Member Author

giovp commented Feb 28, 2023

Regarding regions that are not annotated by the table, this case is super common. Especially since the table can only annotated limited elements, so most of the elements will not be annotated. Examples: the "anatomical" (polygon/shapes) element in merfish, or the nuclei regions in xenium (the table annotates membrane, not nuclei is annotated by the table)

that's a very good point, completely missed that, should be very flexible

@LucaMarconato
Copy link
Member

@giovp

region_key should always be present, even if region is str

I'll update this PR to reflect this. #159

All the io code and the to_zarr.py in spatialdata sandbox need also to be updated. I can do that in spatialdata-sandbox in this PR giovp/spatialdata-sandbox#17 (or also maybe just do it during the review and then merge).

@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Feb 28, 2023

@giovp and @LucaMarconato , thanks for looking into this. I like/agree with the following from above:

  • region_key should always be present, even if region is str
  • instance_key type should be consistent across regions. I think the easiest solution is to enforce it to be int32/int64 but I understand people would want to have pandas series of str type. Had issues serializing it with zarr so need to take another look

I agree with @LucaMarconato about the methods to filter tables for elements in the containing SpatialData object and SpatialData objects for elements that are in the table.

I also agree that we need to allow elements that are not in the table (but I think @giovp you are already on board with that)

@giovp giovp mentioned this issue Mar 1, 2023
5 tasks
@LucaMarconato
Copy link
Member

So even if the consensus was to use int for instance_key, somehow we did not changed it and the table model currently allows strings. Changing this now would require a file format change since the new validator would not pass on old datasets.

I propose to change the readers in spatialdata-io (in particular xenium()) to start using int for instance_key and later on deprecate the use of str, change the file format versions and provide a tested migration tool.

CC @melonora

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants