Fix categorical (factor) handling in native R reader #86

jackkamm · 2023-02-06T23:17:43Z

When I try to use the native R reader on an h5ad file I have, like so:

sce <- readH5AD("/path/to/my/anndata.h5ad", 
                reader='R',
                use_hdf5=TRUE,
                verbose=TRUE)

I get the following warning:

Warning messages:
1: In value[[3L]](cond) : setting 'colData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame
2: In value[[3L]](cond) : setting 'rowData' failed for
  '/path/to/my/anndata.h5ad':
  cannot coerce class "list" to a DataFrame

And the colData and rowData of the returned SCE are empty.

Tracking it down, it's because zellkonverter was unable to convert some Categorical (factor) columns in the obs/var. It's because AnnData 0.8 changed the way Categoricals are encoded.

This updates the native R reader so it can read factors encoded with the newer format. After applying this patch I was able to read my h5ad file.

EDIT:

Related: #78

Thought I should also add why I'm using the R reader rather than the recommended python reader. It's because the python reader had errors reading X/layers matrices from my h5ad file due to unsorted indptr (seems to be the same issue as theislab/anndata2ri#51). I was able to fix some of the errors by calling sort_indices() and resaving the AnnData, but some matrices still wouldn't load, and the ones that did were loaded as dgCMatrix instead of DelayedArray. By contrast I found the R reader successfully read all matrices from my h5ad file as DelayedArray.

LTLA · 2023-02-06T23:59:01Z

@lazappi will have more comments, but one thing I'll note is that the change must be back-compatible with the previous H5AD formats. I'm not familiar with the differences but it sounds like __categories no longer exists in the new format, in which case you could check for that in names(fields) and switch to your approach if it doesn't exist. This avoids weird interactions between old versions of the files that happen to use the new keywords.

(A better approach would be to detect the file version up-front and pass it along to each internal function, which avoids the need for guessing inside each function. However, this would involve a more dramatic set of changes.)

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

jackkamm · 2023-02-07T00:10:35Z

Also, I have to say it, but the surrounding code uses 4-space indenting, and so should you.

Thanks for catching that, fixed now.

but one thing I'll note is that the change must be back-compatible with the previous H5AD formats

I think the changes should be back-compatible. I did not delete the handling for the old categories format, but simply added code to handle factors encoded in the new format. Theoretically I think it could even handle an h5ad that contains a mix of factors encoded in both old & new formats (not that such a scenario would happen).

I could switch to an explicit if/else instead if preferable, perhaps by checking if the __categories key exists, or maybe there's a better way to determine which format the AnnData is in?

EDIT: I went ahead and updated the code to wrap it in an if/else that checks if the __categories key exists.

lazappi · 2023-02-07T08:22:21Z

Hi @jackkamm. Thanks for the contribution! The R reader is still a work in progress but I know some people have tried using it so this will be appreciated. I just have a couple of points:

There is now a written spec for the anndata 0.8 H5AD format https://anndata.readthedocs.io/en/latest/fileformat-prose.html. If you could check that everything is consistent that would be great.
It would be great to have at least one test for this (either a normal test or a long test). The simplest thing might be to write a file with the Python 0.8 writer and see if it can be read with the R reader (making sure it has the right kind of columns).
Do you think this solves readH5AD(..., reader="R") fails with recent AnnData formats? #78 as well or is that a slightly different issue?

jackkamm · 2023-02-09T02:49:44Z

There is now a written spec for the anndata 0.8 H5AD format https://anndata.readthedocs.io/en/latest/fileformat-prose.html. If you could check that everything is consistent that would be great.

Thanks for the reference -- looks like more work needs to be done to support nullable booleans and ints as per the spec. I will look into it.

It would be great to have at least one test for this (either a normal test or a long test). The simplest thing might be to write a file with the Python 0.8 writer and see if it can be read with the R reader (making sure it has the right kind of columns).

Sounds good, I've added a test, still a work-in-progress since it needs to check the nullable booleans mentioned above.

Do you think this solves #78 as well or is that a slightly different issue?

I think it's basically the same issue, I was getting the same error message, and the PR fixed it for me (as in I was able to read in the h5ad). But the original PR only fixed a subset of the problem (factors/categories), a little more work needs to be done to make sure other types are also converted correctly.

lazappi · 2023-02-09T07:42:19Z

👍🏻 It sounds like you are still working on some things which is great, just ping me when you are ready for a review

jackkamm · 2023-02-17T22:10:53Z

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: https://github.com/scverse/anndata/blob/8e793af01a77d0e31e91a72f3988df7d6de9cdc5/anndata/_io/h5ad.py#L58

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: https://gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

lazappi

Overall I'm pretty happy with this. I have made a few comments but they are mostly questions. I think this is a fairly significant contribution so I would be happy for you to add yourself as a contributor in the DESCRIPTION if you like.

@LTLA @ivirshup do you have any further comments?

lazappi · 2023-02-20T09:56:33Z

R/read.R

@@ -289,16 +298,57 @@ readH5AD <- function(file, X_name = NULL, use_hdf5 = FALSE,
    mat
 }

+.read_convert <- function(file, path, recursive=FALSE) {


Do we think we would ever want to do this non-recursively? Not a big deal, just wondering if the argument is needed (but I think it's fine to leave it).

R/read.R

lazappi · 2023-02-20T10:01:04Z

R/read.R

-            rhdf5::h5read(file, file.path(path, col_name))
-        )
+        out_cols[[col_name]] <- .read_convert(file, file.path(path, col_name),
+                                              recursive=FALSE)


Ah, I see we use recursive=FALSE here, can probably ignore the earlier comment then

tests/testthat/test-read.R

lazappi · 2023-02-20T10:04:02Z

tests/testthat/test-read.R

+    # check colData columns that Python reader is able to handle
+    good_coldat_columns <- c('cell_type', 'dummy_bool', 'dummy_int',
+                             'dummy_num', 'dummy_num2')


Is it the NA thing that doesn't work with the Python reader or something else?

lazappi · 2023-02-20T10:04:58Z

tests/testthat/test-read.R

+    expect_equal(colData(sce_r)$dummy_bool2,
+                 c(FALSE, NA, rep(TRUE, 638)))


Ah, I see. This shouldn't be happening, any thoughts on what might be going on here?

lazappi · 2023-02-20T10:06:23Z

tests/testthat/test-read.R

+    # a bug in the python reader?)
+    expect_equal(
+        as.vector(metadata(sce_py)[['dummy_bool']]),
+        metadata(sce_r)[['dummy_bool']]


Hmmmm...not sure. It may be something to do with how {reticulate} converts single values.

lazappi · 2023-02-20T10:12:11Z

I've added handling for nullable ints and bools, so I believe the implementation satisfies the v0.8 spec now [1], and is ready for a review.

For the test, I found that writeH5AD() wasn't quite consistent with the spec [2], therefore I manually created a separate AnnData v0.8 object in Python [3], which I saved to the git repo.

I think we need to look into what is going on with the Python reader, I'm not sure about the environment business... That can probably be a separate PR though, but if you want to open an issue about it that would be helpful.

Footnotes:

[1] The spec also mentions string handling, but I didn't test it because AnnData.write always converts strings to factors: scverse/anndata@8e793af/anndata/_io/h5ad.py#L58

I don't think it always happens, but yes, most of the time it does.

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.

Another thing to check...

[3] Here is the Python code I used to create krumsiek11_augmented_v0-8.h5ad: gist.github.com/jackkamm/3b606d15d83063ed8e5f03ae1c7ab928

This should be added to inst/scripts as a description of the dataset. If you could add a comment with the important package versions that would be great as well.

jackkamm · 2023-02-21T04:29:10Z

Thanks for the helpful reviews :) I might be a little slow getting to it due to some other deadlines right now, but hopefully should respond & have this done later next week

jackkamm · 2023-03-05T22:04:46Z

I've revised this PR now based on the feedback. I also squashed all the commits and force pushed.

Biggest changes are:

Use the encoding-type attribute during conversion. Also pass in the path to the recursive conversion function in order to do this.
Pass version along to internal functions as @LTLA recommended earlier. If version < 0.8, then behavior of native reader is unchanged from before.

Also in regard to this:

I don't think it [string to factor conversion] always happens, but yes, most of the time it does.

Seems like the conversion doesn't happen when the string values are all unique. I added such a column of unique strings to the test data. rhdf5 seems able to read it without any issue.

jackkamm · 2023-03-05T22:07:09Z

Another minor comment I forgot to add:

[2] writeH5AD seems to convert NA_integer_ to -2^31, instead of using a mask as in the spec.
Another thing to check...

This might be a relatively minor issue. When I tested it before, rhdf5 and h5py both converted the -2^31 to NA/nan when reading it into R/python, so users may not notice the issue in practice.

lazappi · 2023-03-06T08:47:27Z

R/read.R

+    version <- match.arg(version, .AnnDataVersions)
+


I'm not sure if this is the right thing to do. The current .AnnDataVersions records the versions there are {basilisk} environments for but that might not match up with all the possible Python versions. It probably depends a bit what this is used for.

Agreed. I changed it now to:

if (is.null(version)) { version <- .AnnDataVersions[1] }

so when it's null it'll match the default python version, but otherwise isn't constrained.

lazappi · 2023-03-06T08:51:13Z

R/read.R

+    # Should we wrap this in suppressWarnings? rhdf5 will warn that it
+    # can't yet read enum (factor/boolean) in attributes
+    element_attrs <- rhdf5::h5readAttributes(file, path)


Hmmm...maybe. If it's something we handle so the warning isn't meaningful to the user then I guess that would make sense. There's just the risk that we suppress a future warning that it would be useful to know about.

lazappi · 2023-03-06T08:53:03Z

R/read.R

+        # Can't determine orderedness due to rhdf5 not yet supporting
+        # enums in attributes
+        #ord <- as.logical(element_attrs[["ordered"]])


Ah, I guess this is where the warning comes in. If we aren't really addressing it maybe we should leave it (and try to fix this upstream).

Agree it would be good to fix this upstream...

Also, I consolidated the 2 comments into a single comment, to try and make this easier to keep track of in future.

jackkamm · 2023-03-07T14:16:12Z

Thinking about it some more, I may have been confused about the "version" argument, and maybe it was a mistake to try to use it at all.

After all the python AnnData package is able to read in older AnnData just fine without needing to specify the version. So we should too.

As it stands now, the PR won't properly read AnnData v0.7 when version=NULL, because the old-style conversion for categories only happens when we explicitly specify a version < 0.8.

I'm pretty sure if I remove the compareVersion calls, the PR should be compatible with both old and new AnnData versions automatically.

Let me know if you want me to revert the explicit version handling and I'll update the PR accordingly.

jackkamm · 2023-03-07T14:26:30Z

R/read.R

+            )
+            out_cols[[cat_name]] <- factor(out_cols[[cat_name]])
+            levels(out_cols[[cat_name]]) <- levels
+        }


This is the problem I mentioned in my comment just now. AnnData 0.7 categories are only converted if version explicitly specified.

lazappi · 2023-03-07T16:16:10Z

So, in the Python reader the version is there to control which environment is used (and therefore which AnnData file version is read/written). In the R reader we aren't messing around with environments so it maybe isn't needed. It just depends whether it's easier to detect what file version has been used or ask the user to specify it.

If we were looking at a writer it would be a bit different because in that case we would want to give the user control over which file version is written.

jackkamm · 2023-03-13T00:12:05Z

I made 1 more commit, so that the version isn't passed into native R reader anymore. Instead, native R reader just tries to figure out the right thing to do based on the keywords/attributes it sees, like in the original version of this PR.

I think it's probably better this way, so the user doesn't need to explicitly specify the version.

But I can see the argument the other way also. So I leave this last change as a separate unsquashed commit -- feel free to revert it if you prefer the previous approach that explicitly passes in the version.

lazappi · 2023-03-15T14:28:15Z

Thanks for all your work on this! I suspect it might need some tweaks in the future but I have merged what we have so far.

lazappi requested changes Feb 20, 2023

View reviewed changes

jackkamm force-pushed the master branch 2 times, most recently from a2e8183 to f639a90 Compare March 5, 2023 21:53

jackkamm mentioned this pull request Mar 5, 2023

Problem with missing values in AnnData 0.8.0 #87

Closed

lazappi reviewed Mar 6, 2023

View reviewed changes

Add native reader support for AnnData v0.8.0

843fbc2

jackkamm force-pushed the master branch from f639a90 to 843fbc2 Compare March 7, 2023 03:31

jackkamm commented Mar 7, 2023

View reviewed changes

Remove need to explicitly specify older version in native R reader

0e91e52

lazappi merged commit 69b1c92 into theislab:master Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix categorical (factor) handling in native R reader #86

Fix categorical (factor) handling in native R reader #86

jackkamm commented Feb 6, 2023 •

edited

Loading

LTLA commented Feb 6, 2023

jackkamm commented Feb 7, 2023 •

edited

Loading

lazappi commented Feb 7, 2023

jackkamm commented Feb 9, 2023

lazappi commented Feb 9, 2023

jackkamm commented Feb 17, 2023 •

edited

Loading

lazappi left a comment

lazappi Feb 20, 2023

lazappi Feb 20, 2023

lazappi Feb 20, 2023

lazappi Feb 20, 2023

lazappi Feb 20, 2023

lazappi commented Feb 20, 2023

jackkamm commented Feb 21, 2023

jackkamm commented Mar 5, 2023

jackkamm commented Mar 5, 2023 •

edited

Loading

lazappi Mar 6, 2023

jackkamm Mar 7, 2023

lazappi Mar 6, 2023

lazappi Mar 6, 2023

jackkamm Mar 7, 2023 •

edited

Loading

jackkamm commented Mar 7, 2023 •

edited

Loading

jackkamm Mar 7, 2023

lazappi commented Mar 7, 2023

jackkamm commented Mar 13, 2023

lazappi commented Mar 15, 2023

		expect_equal(colData(sce_r)$dummy_bool2,
		c(FALSE, NA, rep(TRUE, 638)))

Fix categorical (factor) handling in native R reader #86

Fix categorical (factor) handling in native R reader #86

Conversation

jackkamm commented Feb 6, 2023 • edited Loading

LTLA commented Feb 6, 2023

jackkamm commented Feb 7, 2023 • edited Loading

lazappi commented Feb 7, 2023

jackkamm commented Feb 9, 2023

lazappi commented Feb 9, 2023

jackkamm commented Feb 17, 2023 • edited Loading

lazappi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lazappi commented Feb 20, 2023

jackkamm commented Feb 21, 2023

jackkamm commented Mar 5, 2023

jackkamm commented Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackkamm Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

jackkamm commented Mar 7, 2023 • edited Loading

Choose a reason for hiding this comment

lazappi commented Mar 7, 2023

jackkamm commented Mar 13, 2023

lazappi commented Mar 15, 2023

jackkamm commented Feb 6, 2023 •

edited

Loading

jackkamm commented Feb 7, 2023 •

edited

Loading

jackkamm commented Feb 17, 2023 •

edited

Loading

jackkamm commented Mar 5, 2023 •

edited

Loading

jackkamm Mar 7, 2023 •

edited

Loading

jackkamm commented Mar 7, 2023 •

edited

Loading