Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update validation for obs['is_primary_data'] to enforce False if uns['spatial']['is_single'] is False #865

Merged
merged 42 commits into from
May 2, 2024

Conversation

nayib-jose-gloria
Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria commented Apr 30, 2024

Reason for Change

Changes

  • add _validate_spatial_is_primary_data to _check_spatial_obs
  • refactored some unit tests to use a new pytest fixture, validator with a dataset that is spatial and is_single is False (since we were reusing that set-up in a few spots)

@nayib-jose-gloria nayib-jose-gloria changed the base branch from mim/829-spatial-obs to main April 30, 2024 20:15
@nayib-jose-gloria nayib-jose-gloria changed the base branch from main to mim/829-spatial-obs April 30, 2024 20:56
@@ -1000,13 +1000,14 @@ def _validate_obsm(self):
self.errors.append(
f"Suffix for embedding key in 'adata.obsm' {key} does not match the regex pattern {regex_pattern}."
)
else:
elif key.lower() != "spatial":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff is from #827 merged into main

if not re.match(regex_pattern, key):
self.errors.append(
f"Embedding key in 'adata.obsm' {key} does not match the regex pattern {regex_pattern}."
)
self.warnings.append(
f"Embedding key in 'adata.obsm' {key} does not start with X_ and thus will not be available in Explorer"
f"Embedding key in 'adata.obsm' {key} is not 'spatial' nor does it start with 'X_'. Thus, it will "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff is from #827 merged into main

self.errors.append(
"'spatial' embedding is forbidden in 'adata.obsm' if " "adata.uns['spatial']['is_single'] is not set."
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff is from #827 merged into main

@@ -142,7 +142,7 @@
"HANCESTRO:0575",
"HsapDv:0000003",
"donor_1",
"nucleus",
"na",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff is from #827 merged into main

# Conflicts:
#	cellxgene_schema_cli/cellxgene_schema/validate.py
#	cellxgene_schema_cli/tests/test_schema_compliance.py
#	cellxgene_schema_cli/tests/test_validate.py
Base automatically changed from mim/829-spatial-obs to main May 1, 2024 20:28
# Conflicts:
#	cellxgene_schema_cli/cellxgene_schema/validate.py
#	cellxgene_schema_cli/tests/test_schema_compliance.py
#	cellxgene_schema_cli/tests/test_validate.py
@@ -1993,30 +2010,18 @@ def test_obsm_values_no_X_embedding__non_spatial_dataset(self, validator_with_ad
"WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.",
]

@pytest.mark.parametrize(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this test up into two for simplicity in the set-up

validator.validate_adata()
assert validator.errors == []

def test_obsm_values_spatial_embedding_present__is_single_false(self, validator_with_visium_assay):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this test into test_validate.py as test__validate_spatial_is_single_false_ok and renamed now that we have the fixture adata_spatial_is_single_false with spatial embedding present

"""
Validate is_primary_data for spatial datasets.
"""
is_single = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use _is_single() here once this is rebased on main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes good call

return
if is_single is False and obs["is_primary_data"].any():
self.errors.append(
"When uns['spatial']['is_single'] is False, " "obs['is_primary_data'] must be False for all rows."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a break in the string here (is False, " "obs['is_primary_data']).

Copy link
Contributor Author

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I decided not to explicitly add a test_schema_compliance test case for allowing is_primary_data to be True or False if uns["spatial"]["is_single"] is None, because we implicitly cover this in test_validate.py by running test__validate_with_h5ad_valid_and_labels, which uses a fixture that has is_primary_data values True and False and no spatial in uns. I could be convinced to add the test case for clarity, but leaning towards no. LMK what you think

validator.validate_adata()
assert validator.errors == []
assert validator.is_spatial is True

def test_obsm_values_no_X_embedding__slide_seq_v2_dataset(self, validator_with_slide_seq_v2_assay):
validator = validator_with_slide_seq_v2_assay
validator.adata.uns["default_embedding"] = "spatial"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test is missing validator.validate_adata() and the corresponding assertions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sad, it must've gotten lost along the way during a merge resolution

obs = getattr_anndata(self.adata, "obs")
if obs is None or "is_primary_data" not in obs:
return
if self._is_single() is False and obs["is_primary_data"].any():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are "secondary errors" typically handled? For example, should this error be reported for a non-spatial dataset that has (incorrectly) specified uns.spatial.is_single = False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we accumulate all errors by appending to "self.errors", something like that would be handled wherever the uns check is being done. In this case, we should handle that in check_spatial_uns. I'll double check if we already are and add it in if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait I see what you mean--yes, we do report secondary errors like this. That way, if the problem was that they did mean for it to be spatial but accidentally used the wrong assay, we still tell them in 1 run all the other issues associated with it. Generally, we try to report every possible error in one go even if they'd all be fixed with 1 upstream correction. With some exceptions--we don't validate the X matrix, for instance, if there are issues with dependent columns before then, because its an expensive operation to run if we know there will be failures. But we do print a warning letting them know that this was the case--WARNING: "Validation of raw layer was not performed due to current errors, try again after fixing current errors."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think its fine that we exit early w/o checking spatial values in check_spatial_uns if assay is not 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2), since curators have not requested otherwise and the errors would end up being the same for each one (i.e. this key should not exist if its not visium X times)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks Nayib!

@MillenniumFalconMechanic
Copy link
Collaborator

Looks good, thanks Nayib! I have added a couple of comments.

validator.adata.obsm["spatial"] = validator.adata.obsm["X_umap"]
def test_obsm_values_no_X_embedding__visium_dataset(self, validator_with_visium_assay):
validator = validator_with_visium_assay
validator.adata.uns["default_embedding"] = "spatial"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the complete, required shape for adata here but I see there are a couple of differences between the previous version and validator_with_visium_assay (e.g. validator.adata.obsm["harmony"] is no longer defined, validator.adata.uns["default_embedding"] is now spatial). Are these updates expected/OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its expected

for non-spatial datasets, not having an X-prefix embedding like X_umap should result in an error (see test_obsm_values_no_X_embedding__non_spatial_dataset).

However, for spatial datasets, we do not require an X-prefix embedding, just 'spatial' embedding. This test case was originally modeled after test_obsm_values_no_X_embedding__non_spatial_dataset, which used 'harmony' as its non-X-prefix embedding to trigger the error. But I realized spatial is already a non-X-prefix embedding so we didn't need to add another one to test.

Copy link
Collaborator

@MillenniumFalconMechanic MillenniumFalconMechanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Nayib!

@nayib-jose-gloria
Copy link
Contributor Author

@brian-mott ready for your review!

@brian-mott
Copy link
Collaborator

Looks good, thanks Nayib!

@nayib-jose-gloria nayib-jose-gloria merged commit d8e94b5 into main May 2, 2024
6 checks passed
@nayib-jose-gloria nayib-jose-gloria deleted the nayib/834-primary-data branch May 2, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants