Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ref paths #342

Closed
wants to merge 17 commits into from
Closed

Ref paths #342

wants to merge 17 commits into from

Conversation

flying-sheep
Copy link
Member

@flying-sheep flying-sheep commented Mar 19, 2020

Continuation of scverse/scanpy-tutorials#15 in the proper repo.

This is already pretty usable, I’ll extend and refine it. API:

  • RefPath(attr: str, *path: str|int) creates and validates a path
  • RefPath.parse(str|(str|int)s|RefPath) does too, allowing shortcuts like "o/foo"("obs", "foo")
  • RefPath.get_vector(adata, alias_col) fetches a 1D np.ndarray
  • AnnData.get_vector(*path: str|RefPathLike, dim: "obs"|"var", layer: str, …) allows fallback. E.g. ad.get_vector('foo', dim='obs') finds foo in X or obs while ad.get_vector('X_pca', 0) finds it in obsm. Or so.
  • AnnData.get_df(paths: (str|RefPathLike)s, dim: "obs"|"var", *, layer: str, …) multiplexed AnnData.get_vector. Returns a data frame with unique column names. Colname builder resolves collisions: "obsm/X_pca/0" becomes "X_pca1" while "obsm/protein/CD11b_TotalSeqB" becomes "CD11b_TotalSeqB"

Docs:

Checklist:

  • RefPath.get_vector supports paths as shorthand "X/Actb" or long ("obs", "A/B")

  • add RefPath.dim which is e.g. useful in plotting to check if all RefPaths will return matching stuff

  • handle numeric key situation.

    • specifically for varp/obsp
    • generically for varm/obsm: But how? always converting strings to ints when possible is bad since DataFrame ambiguities, but "obsm/X_pca/0" is useful if X_pca is an array.
  • get_vector and get_df methods on AnnData that allow leaving things out there when not ambiguous

    • unambiguous subpaths or keys should resolve (e.g. ("X_pca", 0)("obsm", "X_pca", 0)), use the usual layer, … params as fallback
    • implement get_df
      • allow multipaths: [("obsm", "X_pca", [0, 1])] would expand to [("obsm", "X_pca", 0), ("obsm", "X_pca", 1)]
      • allow globs: How? just allow "X/var/*" or so? How to specify in tuples?
        • options: “Could we use either fspath or regex matching?”
      • maybe allow specifying colnames instead of deriving from paths (maybe by passing Mapping[str, RefPathLike]?)
    • implement get_vector
  • add more tests:

    • get_df
    • parsing
    • multipath splitting
    • alias_col
    • AnnData method

@ivirshup
Copy link
Member

ivirshup commented Mar 20, 2020

This looks really awesome!

A few comments and questions

allow globs: How? just allow "X/var/*" or so? How to specify in tuples?

Could we use either fspath or regex matching?

Also, what does "X/var" mean?


Couple questions about the current scope:

What about getting things from var aligned arrays? How about getting rows from obsm arrays?

k: str,
coldim: Literal["obs", "var"],
idxdim: Literal["obs", "var"],
layer: Optional[str] = None,
Copy link
Member

@falexwolf falexwolf Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding an axis=None param that defaults to 0 for obsp and to ? for obsm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To resolve the ambiguity of indexing into obsp (defaulting to rows), and potentially allows indexing into rows of obsm (where it defaults to columns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what for obsm?

@falexwolf
Copy link
Member

Looks great, Phil! 😄


from anndata import AnnData
from anndata._core.ref_path import RefPath, split_paths

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there also be tests for ambiguities throwing errors here?

Copy link
Member Author

@flying-sheep flying-sheep Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean ambiguities in AnnData.get_vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@flying-sheep
Copy link
Member Author

Also, what does "X/var" mean?

It’s because they’re unambiguous, so the axis has to be specified:

("X", "var", "Foo")means “get the n_obs length vector at np.where(var_names == 'Foo')[0]”.

What about getting things from var aligned arrays? How about getting rows from obsm arrays?

That doesn’t fit with the dim attribute, but if you can come up with a good syntax and a solution for that we can do it!

@flying-sheep
Copy link
Member Author

Can you help me come up with shortcut paths that AnnData.resolve_path should support?

Do you agree with the following? Would you add more?

@pytest.mark.parametrize("short_path,resolved", [
    # single strings should be found in {obs,var}{.columns,_names}
    # but not in {obs,var}m DataFrame columns (too deep and obscure)
    ("Cell1", ("layers", "X", "obs", "Cell1")),
    ("group", ("obs", "group")),
    # keys with subpaths should be found in layers, {obs,var}{p,m}
    ("X_pca/1", ("obsm", "X_pca", 1)),
    ("unspliced/GeneY", ("layers", "unspliced", "var", "GeneY")),
    # {obs,var}p should default to axis 0
    (("neighbors_distances", "Cell2"), ("obsp", "neighbors_distances", "Cell2", 0)),
    ...?
])
def test_resolve(adata, short_path, resolved):
    assert adata.resolve_path(short_path) == RefPath(*resolved)

@ivirshup
Copy link
Member

I don't think I like the axis being specified the same way a nested element would be. I think it's mixing metaphors, and namespaces. I would prefer an axis or dim key word argument for this.

It's also something that won't need to be specified by the user in the contexts we've been discussing, since it's implied. For example, in obs_df, all elements should be aligned to obs_names. For sc.pl.embedding, similar.

That doesn’t fit with the dim attribute, but if you can come up with a good syntax and a solution for that we can do it!

Alright. I don't think this is too important for now.

@flying-sheep
Copy link
Member Author

The dim keyword exists. This is for cases where it’s None

@ivirshup
Copy link
Member

ivirshup commented Mar 20, 2020

The dim keyword exists. This is for cases where it’s None

I don't think it should be the second argument like this. There could maybe be a different delimiter for it, but otherwise I think it's an argument to the RefDim constructor.

@ivirshup
Copy link
Member

Recap of our call:

  • I don't like specifying the dimension as a sub-element. I think it would be good to be able to specify the dimension in a string representation, but I don't think it's critical right now. Can we make it so cases like "X/var/elem" don't work?
  • The internals you built look interesting, and I'd like a chance to play around with them to find edge cases and see how they can be extended. So we have a chance to do this, and can have a release soon, could we make them private for now? I'd like to make the surface of this minimal until we've had time to play with it more. Also the really important thing this does is allow stuff like sc.pl.embedding(adata, colors="obsm/X_pca/0"), which doesn't require adata.get_vector to be public right now.

@flying-sheep
Copy link
Member Author

I added an “Internal API” page because all those docs don’t write themselves.

anndata/_core/ref_path.py Outdated Show resolved Hide resolved
@flying-sheep
Copy link
Member Author

flying-sheep commented Mar 20, 2020

resolve_path progress:

  • finds single strings in {obs,var}{.columns,_names}
    but not in {obs,var}m DataFrame columns (too deep and obscure)
    • "Cell1"RefPath("layers", "X", "obs", "Cell1")
    • "group"RefPath("obs", "group")
  • finds keys with subpaths in layers, {obs,var}m
    • "X_pca/1"RefPath("obsm", "X_pca", 1)
    • "unspliced/GeneY"RefPath("layers", "unspliced", "var", "GeneY")
  • find layers in {obs,var}p, default to axis 0
    • ("neighbors_distances", "Cell2")RefPath("obsp", "neighbors_distances", "Cell2", 0)
  • Error on ambiguity

@flying-sheep
Copy link
Member Author

docs TODO:

  • remove adata from docstring
  • examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants